Modelling and evaluating a high serviceability fault tolerance strategy in cloud computing environments

Dawei Sun,Xingwei Wang,Guiran Chang,Changsheng Miao

doi:10.1504/ijsn.2012.053458

Abstract

In this paper, the definitions of fault, error, and failure in a cloud are given and the principles for high fault tolerance objectives are systematically analysed by referring to the fault tolerance theories suitable for large–scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (1) analysing the mathematical relationship between different failure rates and checkpointing fault tolerance strategy; (2) building a dynamic adaptive checkpointing fault tolerance model to maximise the serviceability and meet the SLOs; and (3) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large–scale cloud data centres and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements.

Full Text