Abstract

To reduce the storage cost, distributed storage systems are gradually using erasure codes to ensure data reliability. Liberation codes, which satisfy the maximum distance separable (MDS) property and provide optimal modification overhead, are a class of popular two fault tolerant erasure codes. However, erasure codes need to read from surviving nodes and transfer across the network large amounts of data when recovering from single node failures. Existing single node failure recovery approaches for Liberation codes are either time-consuming or suboptimal. In this article, firstly, we prove the minimum number of symbols required to recover one failed node for a Liberation coded system. Then we derive the conditions that optimal recovery solutions need to satisfy. Finally, we propose an algorithm, called Disk Read Optimal Recovery (DROR), which can determine an optimal recovery solution in linear time and recover the failed node reading the minimum amount of data. We have implemented DROR in a real-world storage system Ceph and evaluated DROR on a cluster of Amazon EC2 instances. We show that DROR reduces the reconstruction time by up to 23.6% compared to that of the recovery approach in Ceph.

Highlights

  • Inexpensive components are preferred for use in modern distributed storage systems due to the economic benefits; these components are less reliable, and data may become temporarily or permanently unavailable

  • 2) We propose a recovery algorithm called Disk Read Optimal Recovery (DROR), which reaches the lower bound of disk read and decreases almost 25% of the disk read in theory compared with that of the conventional approach

  • We study the problem of minimizing the number of symbols read from surviving nodes when repairing an erased data node in Liberation coded storage systems

Read more

Summary

INTRODUCTION

Inexpensive components are preferred for use in modern distributed storage systems due to the economic benefits; these components are less reliable, and data may become temporarily or permanently unavailable. N. Liang et al.: Optimal Recovery Approach for Liberation Codes in Distributed Storage Systems can be recovered by copying any one surviving replica. Liang et al.: Optimal Recovery Approach for Liberation Codes in Distributed Storage Systems can be recovered by copying any one surviving replica This k-factor increases both in disk I/O1 and network traffic result in a long recovery time, which may seriously affect the system service performance. The consensus for storage systems is that two-failure tolerance is the right level of tolerance, assuming that data stripes are not large Our work supports this trend, we are concerned with one kind of MDS RAID-6 codes — Liberation codes and investigate their recovery performance in distributed storage systems.

BACKGROUND
MATRIX-VECTOR DEFINITION
TWO-DIMENSIONAL ARRAY DESCRIPTION
READ-OPTIMAL RECOVERY SEQUENCES
READ-OPTIMAL RECOVERY ALGORITHM
RESULTS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call