Boosting adaptivity of fault-tolerant scheduling for real-time tasks with service requirements on clusters

Xiaomin Zhu,Chuan He,Rong Ge,Peizhong Lu

doi:10.1016/j.jss.2011.04.067

Abstract

Abstract Thank to the excellent extensibility and usability, computer clusters have become the dominating platform for parallel computing. Fault-tolerance is mandatory for safety-critical applications running on clusters. In this paper we propose a service-aware and adaptive fault-tolerant scheduling algorithm using overlapping technologies (SAO in short) that can tolerate a node’s permanent failure at any time instant for real-time tasks with service requirements in heterogeneous clusters. SAO adopts the primary/backup model and considers the timing constraints, service requirements, and system resource utilization. To improve system resource utilization, we employ backup- backup (BB in short) and primary- backup (PB in short) overlapping technologies and analyze the overlapping constraints. In addition, SAO has high system adaptivity by dynamically adjusting the service levels of tasks based on system load. Furthermore, to improve resource utilization and schedulability, SAO makes backup copies adopt passive execution scheme or decrease the overlapping execution time of the primary copy and backup copy of a task as much as possible. Compared with a baseline algorithm SAWO (a service-aware and adaptive fault-tolerant scheduling algorithm without using overlapping technologies) and an existing algorithm DYFARS with simulation experiments, SAO achieves an average of 51.25% improvement in performability.

Full Text