SelectiveEC: Towards Balanced Recovery Load on Erasure-Coded Storage Systems

Liangliang Xu,Qiliang Li,Lingjiang Xie,Min Lyu,Cheng Li,Yinlong Xu

doi:10.1109/tpds.2021.3129973

Abstract

Erasure coding (EC) has been commonly used to offer high data reliability with low storage cost. Upon failures, the lost blocks are recovered in batches. Due to the limited number of stripes, the data layout within a batch is non-uniform. Together with the random selection of source and replacement nodes for recovery tasks, the recovery workload among live nodes is skewed within a batch, which severely slows down failure recovery. To solve this problem, We present SelectiveEC, a new recovery task scheduling module that provides provable network traffic and recovery load balancing for large-scale EC-based storage systems. It relies on bipartite graphs to model the recovery traffic among live nodes. Then, it intelligently selects tasks to form batches and carefully determines where to read source blocks or to store recovered ones, using theories such as a perfect or maximum matching and <inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -regular spanning subgraph. SelectiveEC supports single-node failure and multi-node failure recovery, and can be deployed in both homogeneous and heterogeneous network environments. We implement SelectiveEC in HDFS, and evaluate its recovery performance in a local cluster of 18 nodes and AWS EC2 of 50 virtual machine instances. SelectiveEC increases the recovery throughput by up to <inline-formula><tex-math notation="LaTeX">$30.68\%$</tex-math></inline-formula> compared with state-of-the-art baselines in homogeneous network environments. It further achieves <inline-formula><tex-math notation="LaTeX">$1.32\times$</tex-math></inline-formula> recovery throughput and <inline-formula><tex-math notation="LaTeX">$1.23\times$</tex-math></inline-formula> benchmark throughput of HDFS on average in heterogeneous network environments, due to the straggler avoidance by the balanced scheduling.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Parallel and Distributed Systems	Publication Date: Oct 1, 2022
Citations: 6	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

SelectiveEC: Towards Balanced Recovery Load on Erasure-Coded Storage Systems

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Similar Papers

A Load-Aware Multistripe Concurrent Update Scheme in Erasure-Coded Storage System
Junqi Chen ... Ashish Bagwari
Wireless Communications and Mobile Computing | VOL. 2022
Junqi Chen, et. al.Junqi Chen ... Ashish Bagwari
19 May 2022
Wireless Communications and Mobile Computing | VOL. 2022

A tale of two erasure codes in HDFS
...
-
, et. al. ...
16 Feb 2015
16 Feb 2015

Adaptive Updates for Erasure-Coded Storage Systems Based on Data Delta and Logging
Bing Wei ... Yujun Liu
-
Bing Wei, et. al.Bing Wei ... Yujun Liu
01 Jan 2021
01 Jan 2021

Controlling multimedia streams across internet and ATM network
Youngmee Shin ... Sehyeong Cho
-
Youngmee Shin, et. al.Youngmee Shin ... Sehyeong Cho
01 Jan 1998
01 Jan 1998

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SelectiveEC: Towards Balanced Recovery Load on Erasure-Coded Storage Systems

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems