Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks

Hai Zhou,Dan Feng

doi:10.1145/3664926

Abstract

More and more storage systems use erasure code to tolerate faults. It takes pieces of data blocks as input and encodes a small number of parity blocks as output, where these blocks form a stripe. When reconsidering the recovery problem in the multi-stripe level and heterogeneous network clusters, quickly generating an efficient multi-stripe recovery solution that reduces recovery time remains a challenging and time-consuming task. Previous works either use a greedy algorithm that may fall into the local optimal and have low recovery performance or a meta-heuristic algorithm with a long running time and low solution generation efficiency. In this paper, we propose a Stripe-schedule Aware Repair (SARepair) technique for multi-stripe recovery in heterogeneous erasure-coded clusters based on RS code. By carefully examining the metadata of blocks, SARepair intelligently adjusts the recovery solution for each stripe and obtains another multi-stripe solution with less recovery time in a computationally efficient manner. It then tolerates worse solutions to overcome the local optimal and uses a rollback mechanism to adjust search regions to reduce recovery time further. Moreover, instead of reading blocks sequentially from each node, SARepair also selectively schedules the reading order for each block to reduce the memory overhead. We extend SARepair to address the full-node recovery and adapt to the LRC code. We prototype SARepair and show via both simulations and Amazon EC2 experiments that the recovery performance can be improved by up to 59.97% over a state-of-the-art recovery approach while keeping running time and memory overhead low.

Full Text