Abstract
Recently, erasure coding techniques are considered as essential schemes for the reliability of the modern distributed storage systems (DSSs) with the frequent node-level failure. Especially, locally repairable codes (LRCs) are widely adopted by the practical advantage of reducing the latency for repair process. However, recent researches show that many cases for system failure are also originated from the silent disk errors. For the conventional LRCs with low error correction capability, repair process from erasure coding can propagate the silent errors and thus, the DSSs become more vulnerable compared to the cases only with node failure. Therefore, we propose a mean time to data loss (MTTDL) from the modified Markov chain model in order to evaluate effects by silent disk errors. Also, new design of binary error-resilient locally repairable codes (ER-LRCs) with high error and erasure correction capabilities are proposed, which have larger values of bit-wise minimum Hamming distance than the existing LRCs. Here, ER-LRCs can be constructed by modifying the parity check matrix from well-known optimal binary and nonbinary LRCs. From the numerical analysis using the proposed Markov model with empirical parameters, it is shown that the proposed ER-LRCs have better MTTDL values when compared to the existing LRCs.
Highlights
R ESILIENCE for cluster-level distributed storage systems (DSSs) has been one of the significant issues in the stable operation of cloud data centers and high-performance computing (HPC) centers
To improve the reliability of disk failure using maximum distance separable (MDS) array erasure codes, SD and maximally recoverable (MR) erasure codes were designed to optimize the cases with simultaneous failures by two granularities both in the node and disk [2]–[10]
We firstly propose a bit-wise minimum Hamming distance db, which is a new code parameter closely related to error correction capability for silent data corruption (SDC)
Summary
R ESILIENCE for cluster-level distributed storage systems (DSSs) has been one of the significant issues in the stable operation of cloud data centers and high-performance computing (HPC) centers. In order to improve the resilience of DSSs against SDCs, modern DSSs use a method to check data consistency periodically, called disk scrubbing, by finding the checksum on the RAID or the erasure code for all disks. Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS both of which eventually increase the total cost of ownership (TCO) of the service providers To this end, evaluating reliability while considering the system failure in large-scale DSSs is necessary, exact analysis or simulation is hard to estimate [16].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have