Abstract

In this paper, the definitions of fault, error, and failure in a cloud are given and the principles for high fault tolerance objectives are systematically analysed by referring to the fault tolerance theories suitable for large–scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (1) analysing the mathematical relationship between different failure rates and checkpointing fault tolerance strategy; (2) building a dynamic adaptive checkpointing fault tolerance model to maximise the serviceability and meet the SLOs; and (3) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large–scale cloud data centres and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call