Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions

Mohamed Slim Bouguerra,Derrick Kondo,Fernando Mendonca,Denis Trystram

doi:10.1016/j.jpdc.2014.01.005

Abstract

As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Most often, in real systems, failure rates are increasing or decreasing over time. Considering non-memoryless failure distributions, we study a bi-objective scheduling problem of optimizing application makespan and reliability. In particular, we determine whether one can optimize both makespan and reliability simultaneously, or whether one metric must be degraded in order to improve the other. We also devise scheduling algorithms for achieving (approximately) optimal makespan or reliability. When failure rates decrease, we prove that makespan and reliability are opposing metrics. In contrast, when failure rates increase, we prove that one can optimize both makespan and reliability simultaneously. Moreover, we show that the largest processing time (LPT) list scheduling algorithm achieves good performance when processors are of uniform speed. The implications of our findings are the accelerated completion and improved reliability of parallel jobs executed across large distributed systems. Finally, we conduct simulations to investigate the impact of failures on the performance, which is done using an actual application of biological sequence comparison.

Full Text