Abstract

ABSTRACTIn a heterogeneous distributed computing environment, developing a fault tolerance mechanism is a key research issue. Most of the existing fault tolerance approaches for distributed computing environment are post-active. These post-active approaches, predominantly involve the heartbeat strategy for fault detection and the checkpointing mechanism for fault recovery. In this proposed work, a proactive Health Aware Fault Tolerant (HAFT) scheduler using the Cox Proportional Hazard survival probability model is developed. The survival probability of the resource is estimated using resource failure data analytics and termed as health coefficient of the resource. For the job distribution classes jclass1, jclass2, and jclass3, the average improvement for makespan in HAFT algorithm over the compared algorithms are 44, 59.6, and 26.4%. In a heterogeneous environment, the job failure rate of the HAFT scheduler is ranging between 15 and 20% and it is stable for all the three jclasses. In a homogenous environment, the job failure rate of HAFT algorithm in comparison to IRP, REP and MJSP algorithm is considerably reduced by 58.6, 26.4, and 11.6%, respectively. For a failure probability higher than 0.4, the resource efficiency of HAFT algorithm on an average is 26% more than MJSP and 53% more than REP.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call