TY - JOUR
T1 - Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows
AU - Ferreira da Silva, Rafael
AU - Filgueira, Rosa
AU - Deelman, Ewa
AU - Pairo-Castineira, Erola
AU - Overton, Ian Michael
AU - Atkinson, Malcolm
N1 - 11th Workshop on Workflows in Support of Large-Scale Science 2016, WORKS 2016 ; Conference date: 14-11-2016 Through 14-11-2016
PY - 2017/2/28
Y1 - 2017/2/28
N2 - Scientific workflows have become mainstream for conductinglarge-scale scientific research. As a result, many workflowapplications and Workflow Management Systems (WMSs)have been developed as part of the cyberinfrastructure toallow scientists to execute their applications seamlessly ona range of distributed platforms. In spite of many successstories, a key challenge for running workflows in distributedsystems is failure prediction, detection, and recovery. Inthis paper, we propose an approach to use control theorydeveloped as part of autonomic computing to predict failures before they happen, and mitigated them when possible.The proposed approach applying the proportional-integralderivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, tomitigate faults by adjusting the inputs of the controller. ThePID controller aims at detecting the possibility of a fault farenough in advance so that an action can be performed toprevent it from happening. To demonstrate the feasibility ofthe approach, we tackle two common execution faults of theBig Data era—data storage overload and memory overflow.We define, implement, and evaluate simple PID controllersto autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB ofdata, and requires over 24TB of memory to run all tasksconcurrently. Experimental results indicate that workflowexecutions may significantly benefit from PID controllers,in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdownof 1.01) can be attained when using our proposed method,and faults are detected and mitigated far in advance of theiroccurence.
AB - Scientific workflows have become mainstream for conductinglarge-scale scientific research. As a result, many workflowapplications and Workflow Management Systems (WMSs)have been developed as part of the cyberinfrastructure toallow scientists to execute their applications seamlessly ona range of distributed platforms. In spite of many successstories, a key challenge for running workflows in distributedsystems is failure prediction, detection, and recovery. Inthis paper, we propose an approach to use control theorydeveloped as part of autonomic computing to predict failures before they happen, and mitigated them when possible.The proposed approach applying the proportional-integralderivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, tomitigate faults by adjusting the inputs of the controller. ThePID controller aims at detecting the possibility of a fault farenough in advance so that an action can be performed toprevent it from happening. To demonstrate the feasibility ofthe approach, we tackle two common execution faults of theBig Data era—data storage overload and memory overflow.We define, implement, and evaluate simple PID controllersto autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB ofdata, and requires over 24TB of memory to run all tasksconcurrently. Experimental results indicate that workflowexecutions may significantly benefit from PID controllers,in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdownof 1.01) can be attained when using our proposed method,and faults are detected and mitigated far in advance of theiroccurence.
M3 - Article
SN - 1613-0073
VL - 1800
SP - 15
EP - 24
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
ER -