Efficient checkpointing mechanisms for primary‐backup replication on the cloud

Berkin Güler,Öznur Özkasap

doi:10.1002/cpe.4707

Abstract

SummarySeveral distributed services ranging from key‐value stores to cloud storage require fault‐tolerance and reliability features. For enabling fast recovery and seamless transition, primary‐backup replication protocols are widely used in different application settings including distributed databases, web services, and the Internet of Things. In this study, we elaborate the ways of enhancing the efficiency of the primary‐backup replication protocol by introducing various checkpointing techniques. We develop a geographically replicated key‐value store based on the RocksDB and use the PlanetLab testbed network for large‐scale performance analysis. Using various metrics of interest including blocking time, checkpointing time, checkpoint size, failover time, and throughput and testing with practical workloads via the YCSB tool, our findings indicate that periodic‐incremental checkpointing promises up to 5 times decrease in blocking time and a drastic improvement on the overall throughput compared to the traditional primary‐backup replication. Furthermore, enabling Snappy compression algorithm on the periodic‐incremental checkpointing leads to further reduction in blocking time and increases system throughput compared to the traditional primary‐backup replication.

Full Text