An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

Xiaoyong Tang,Guiping Liao,Kenli Li

doi:10.1007/s10586-014-0372-1

Abstract

In large-scale heterogeneous cluster computing systems, processor and network failures are inevitable and can have an adverse effect on applications executing on such systems. One way of taking failures into account is to employ a reliable scheduling algorithm. However, most existing scheduling algorithms for precedence constrained tasks in heterogeneous systems only consider scheduling length, and not efficiently satisfy the reliability requirements of task. In recognition of this problem, we build an application reliability analysis model based on Weibull distribution, which can dynamically measure the reliability of task executing on heterogeneous cluster with arbitrary networks architectures. Then, we propose a reliability-driven earliest finish time with duplication scheduling algorithm (REFTD) which incorporates task reliability overhead into scheduling. Furthermore, to improve system reliability, it duplicates task as if task hazard rate is more than threshold $$\theta $$ ? . The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm can shorten schedule length and improve system reliability significantly.

Full Text