Abstract

Recently, erasure coding techniques are considered as essential schemes for the reliability of the modern distributed storage systems (DSSs) with the frequent node-level failure. Especially, locally repairable codes (LRCs) are widely adopted by the practical advantage of reducing the latency for repair process. However, recent researches show that many cases for system failure are also originated from the silent disk errors. For the conventional LRCs with low error correction capability, repair process from erasure coding can propagate the silent errors and thus, the DSSs become more vulnerable compared to the cases only with node failure. Therefore, we propose a mean time to data loss (MTTDL) from the modified Markov chain model in order to evaluate effects by silent disk errors. Also, new design of binary error-resilient locally repairable codes (ER-LRCs) with high error and erasure correction capabilities are proposed, which have larger values of bit-wise minimum Hamming distance than the existing LRCs. Here, ER-LRCs can be constructed by modifying the parity check matrix from well-known optimal binary and nonbinary LRCs. From the numerical analysis using the proposed Markov model with empirical parameters, it is shown that the proposed ER-LRCs have better MTTDL values when compared to the existing LRCs.

Highlights

  • R ESILIENCE for cluster-level distributed storage systems (DSSs) has been one of the significant issues in the stable operation of cloud data centers and high-performance computing (HPC) centers

  • To improve the reliability of disk failure using maximum distance separable (MDS) array erasure codes, SD and maximally recoverable (MR) erasure codes were designed to optimize the cases with simultaneous failures by two granularities both in the node and disk [2]–[10]

  • We firstly propose a bit-wise minimum Hamming distance db, which is a new code parameter closely related to error correction capability for silent data corruption (SDC)

Read more

Summary

INTRODUCTION

R ESILIENCE for cluster-level distributed storage systems (DSSs) has been one of the significant issues in the stable operation of cloud data centers and high-performance computing (HPC) centers. In order to improve the resilience of DSSs against SDCs, modern DSSs use a method to check data consistency periodically, called disk scrubbing, by finding the checksum on the RAID or the erasure code for all disks. Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS both of which eventually increase the total cost of ownership (TCO) of the service providers To this end, evaluating reliability while considering the system failure in large-scale DSSs is necessary, exact analysis or simulation is hard to estimate [16].

PRELIMINARY
SYSTEM MODEL OF ERASURE-CODED DSS WITH
THE EXISTING LRC AND PROPOSED ER-LRC
RELIABILITY METRIC
RELIABILITY ANALYSIS OF THE MODIFIED LRCS
ANALYSIS OF MTTDL BEHAVIOR BY PARAMETERS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call