A comprehensive repair scheme for distributed storage systems

Guang Fang,Junmei Chen,Yeqiao Hou,Xianglong Li,Zongpeng Li

doi:10.1016/j.comnet.2023.109954

Abstract

Modern data storage systems apply erasure codes to provide data reliability efficiently. Previous studies proposed a series of techniques to weigh repair/storage costs, reduce codec complexity, minimize repair time, improve fault tolerance, and enforce system-level service level agreement. These techniques have been designed in isolation, leading to performance limitations. We explore the potential advantages of combining these techniques to meet data storage systems’ requirements better and provide superior system performance. This work proposes a comprehensive repair scheme for fault data in distributed storage systems. First, we tailor design erasure codes in the presence of heterogeneity of storage devices. The core idea is to monitor device performance (e.g., access speed, reliability), compute two coefficients for each device, and use them to select the appropriate devices to create stripes of erasure codes. Second, we leverage the system hierarchy to perform intermediary repair operations, further minimizing cross-rack repair bandwidth. Finally, we propose a new repair scheme adapted to the skew of data access. To demonstrate the effectiveness of our comprehensive repair scheme, we evaluate various erasure codes via mathematical analysis and experiments in the Ceph cluster. In the mise-en-scène of traditional re-encoding methods and more recent adaptive erasure codes, our scheme stands out with significant savings in recovery bandwidth, code-switching bandwidth, repair time, and code-switching time.

Full Text