Bandwidth-Aware Scheduling Repair Techniques in Erasure-Coded Clusters: Design and Analysis

Hai Zhou,Yuchong Hu,Dan Feng

doi:10.1109/tpds.2022.3153061

Abstract

Erasure codes offer a storage-efficient redundancy mechanism for maintaining data availability guarantees in storage clusters, yet also incur high network traffic consumption and recovery time in failure repair. Extensive research has been carried out to reduce the recovery time. However, previous works either target specific erasure code constructions which are not commonly used in today’s distributed storage clusters or neglect the heterogeneous bandwidth property in real network environments. Since erasure-coded clusters are typically composed of multi-node with heterogeneous bandwidth and accessed in parallel, the whole recovery time is mainly restricted by the low-bandwidth links. In this article, we propose SMFRepair, a single-node multi-level forwarding repair technique that is designed to improve the performance in heterogeneous networks based on Reed-Solomon codes for general fault tolerance. SMFRepair carefully selects the helper nodes and uses idle nodes to bypass low-bandwidth links. Idle nodes have sufficient and unused network bandwidth. It also pipelines the repair links that are optimized by idle nodes. Furthermore, a multi-node scheduling repair technique, called MSRepair, is proposed. MSRepair carefully schedules the multi-node repair link to saturate the most unoccupied bandwidth and transfers data from as large-bandwidth links as possible, with the primary objective of minimizing the recovery time. Large-scale simulation and Amazon EC2 real experiments show that compared to state-of-the-art repair techniques, SMFRepair can accelerate the single-node recovery by up to 47.69%, and MSRepair can reduce the multi-node recovery time by 33.78% <inline-formula><tex-math notation="LaTeX">$\sim$</tex-math></inline-formula> 67.53%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Bandwidth-Aware Scheduling Repair Techniques in Erasure-Coded Clusters: Design and Analysis

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Dec 1, 2022
Citations: 12

Similar Papers

Multi-level Forwarding and Scheduling Repair Technique in Heterogeneous Network for Erasure-coded Clusters
Hai Zhou ... Yuchong Hu
-
Hai Zhou, et. al.Hai Zhou ... Yuchong Hu
09 Aug 2021
09 Aug 2021

Boosting Erasure-Coded Multi-Stripe Repair in Rack Architecture and Heterogeneous Clusters: Design and Analysis
Hai Zhou ... Dan Feng
IEEE Transactions on Parallel and Distributed Systems | VOL. 34
Hai Zhou, et. al.Hai Zhou ... Dan Feng
01 Aug 2023
IEEE Transactions on Parallel and Distributed Systems | VOL. 34

MRMS: A MOEA-based replication management scheme for cloud storage system
Kangxian Huang ... Dagang Li
-
Kangxian Huang, et. al.Kangxian Huang ... Dagang Li
01 Nov 2015
01 Nov 2015

CRMS: A centralized replication management scheme for cloud storage system
Kangxian Huang ... Yongyue Sun
-
Kangxian Huang, et. al.Kangxian Huang ... Yongyue Sun
01 Oct 2014
01 Oct 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Bandwidth-Aware Scheduling Repair Techniques in Erasure-Coded Clusters: Design and Analysis

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems