A Resource Management System for Fault Tolerance in Grid Computing

Hwamin Lee,Sang-Soo Yeo,Min Hong,Doosoon Park,Sunghoon Kim,Sookyun Kim

doi:10.1109/cse.2009.257

Abstract

In Grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. The failure occurrence of resources in the grid computing is higher than in a tradition parallel computing. Since the failure of resources affects job execution fatally, fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of quality of service (QoS) for desirable operation. However Globus toolkit does not provide fault tolerance service that supports fault detection service and management service and satisfies QoS requirement. Thus this paper proposes fault tolerance service to satisfy QoS requirement in computational grids. In order to provide fault tolerance service and satisfy QoS requirements, we expand the definition of failure, such as process failure, processor failure, and network failure. And we propose resource scheduling service, fault detection service and fault management service and show implement and experiment results.

Full Text