A Survey on Fault Tolerance Mechanisms for job scheduling in Grid computing

S Supriya,S.Dinesh Babu

doi:10.9790/0661-1623120122

Abstract

Grid computing is defined as a hardware and software infrastructure that enables sharing of coordinated resources in a dynamic environment. In grid computing, the probability of a failure is much greater than parallel computing. Therefore, the fault tolerance is an important issue in order to achieve reliability, availability of resources. When scheduling a job, the resource uses both average failure time and failure rate of grid resources combined with resources response time to generate scheduling. There are several reasons for failure in execution such as network failure, resource overloading, or non-availability of required software components. Thus, fault-tolerant systems should be able to identify and rectify the failures and support reliable execution in the presence of failures.In this paper, a survey made on various fault tolerance techniques and mechanism and job management in grid computing.

Full Text