Load balancing is often disregarded when implementing fault tolerance capability in grid computing. Effective load balancing ensures that a fair amount of load is assigned to each resource, based on its fitness rather than assigning a majority of tasks to the most fitting resources. Proper load balancing in a fault tolerance system would also reduce the bottleneck at the most fit resources and increase utilization of other resources. This paper presents a fault tolerance algorithm based on ant colony system, that considers load balancing based on resource fitness with resubmission and checkpoint technique, to improve fault tolerance capability in grid computing. Experimental results indicated that the proposed fault tolerance algorithm has better execution time, throughput, makespan, latency, load balancing and success rate.
Read full abstract