Reducing repair-bandwidth using codes based on factor graphs

Dongwon Lee,Jaekyun Moon,Hyegyeong Park

doi:10.1109/icc.2016.7510728

Abstract

Distributed storage systems suffer from significant repair traffic generated due to frequent storage node failures. This paper shows that properly designed low-density parity-check (LDPC) codes can substantially reduce the amount of required block downloads for repair thanks to the sparse nature of their factor graph representation. In particular, with a careful construction of the factor graph, both low repair-bandwidth and high reliability can be achieved for a given code rate. First, a formula for the average repair bandwidth of LDPC codes is developed. This formula is then used to establish that the minimum repair bandwidth can be achieved by forcing a regular check node degree in the factor graph. It is also shown that for a given repair-bandwidth overhead, LDPC codes can have substantially higher reliability than currently utilized Reed-Solomon (RS) codes. Our reliability analysis is based on a formulation of the general equation for the mean-time-to-data-loss (MTTDL) associated with LDPC codes. The formulation reveals that the stopping number is highly related to MTTDL. For code rates 1/2, 2/3, and 3/4, our results show that quasi-cyclic (QC) progressive-edge-growth (PEG) LDPC codes with variable node degree 2 allow 25% ∼ 50% reduction in the repair bandwidth while maintaining higher MTTDL compared to currently employed RS codes.

Full Text