is a form distributed computing mainly to virtualilze and utilize geographically distributed idle resources. A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result varying resource availability becomes common place, often resulting in loss and delay of executing jobs. To ensure good performance fault tolerance should be taken into account. Here we address the fault tolerance in terms of resource failure. Commonly utilized techniques to achieve fault tolerance is periodic checkpointing, which periodically saves the jobs state. But an inappropriate checkpointing interval leads to delay in the job execution, and reduces the throughput. Hence in the proposed work, the strategy used to achieve fault tolerance is by dynamically adapting the checkpoints based on current status and history of failure information of the resource, which is maintained in the Information server. The Last failure time and Mean failure time based algorithm dynamically modifies the frequency of checkpoint interval, hence increases the throughput by reducing the unnecessary checkpoint overhead. In case of resource failure, the proposed Fault Index Based Rescheduling (FIBR) algorithm reschedules the job from the failed resource to some other available resource with the least Fault-index value and executes the job from the last saved checkpoint. This ensures the job to be executed within the deadline with increased throughput and helps in making the grid environment trust worthy.
Read full abstract