Abstract

Reliability and cost are two important targets for distributed storage systems. For many years, numerous schemes have been proposed to improve the reliability or cost of distributed storage systems, and they can be divided into three categories: (1) data redundancy schemes; (2) data placement schemes; and (3) data repair schemes. However, it is still unclear regarding how to build a reliable and cost-efficient distributed storage system, because (i) insufficient considerations on the combinations of different schemes; and (ii) insufficient considerations on failures and recoveries of different subsystems (racks, nodes, disks, and sectors). To measure the reliability and cost caused by different schemes, we design and implement CR-SIM, a Comprehensive Reliability SIMulator for distributed storage systems. It considers various affecting factors, such as the system topology, the data redundancy scheme, the data placement scheme, the data repair scheme, and the failure/recovery models of different subsystems. By using CR-SIM, we conduct various simulation-based experiments, and the experimental results reveal several important findings, which are helpful to build reliable and cost-efficient distributed storage systems. For public use, we have open-sourced our source code at https://github.com/yichuan0707/CR-SIM.

Highlights

  • Today’s distributed storage systems generally consist of thousands of commodity servers to provide storage services, such as the Google File System (GFS) [1], the Hadoop Distributed File System (HDFS) [2], and the OpenStack Swift object storage (Swift) [3].Reliability is critical for distributed storage systems

  • These schemes can be divided into three categories: (i) data redundancy schemes, such as the replication (REP) [6], ReedSolomon (RS) codes [7], Local Repairable Codes (LRC) [8] and Regenerating Codes [9], [10]); (ii) data placement schemes, such as the spread placement scheme (SSS) [11], partitioned placement scheme (PSS) [11] and CopySet [12] for data placement on nodes, and flat data placement (Flat) and hierarchical data placement (Hier) [13] for data placement on racks; (iii) data repair schemes, such as eager repair (Eager), lazy repair (Lazy) [14], risk-aware failure identification repair (RAFI) [15] and the combination of Lazy and RAFI (Lazy+RAFI) [15]

  • 2) Findings in Swift All findings about the effects of schemes in combinations are established on HDFS pattern, so we check the effects of data redundancy schemes and data placement schemes in Swift pattern

Read more

Summary

Introduction

Today’s distributed storage systems generally consist of thousands of commodity servers to provide storage services, such as the Google File System (GFS) [1], the Hadoop Distributed File System (HDFS) [2], and the OpenStack Swift object storage (Swift) [3].Reliability is critical for distributed storage systems. Cloud storage services like Windows Azure Storage [4] and Amazon S3 [5] aim to achieve a yearly reliability of 11 9’s, i.e., 99.999999999%. Such high reliability is often guaranteed by massive resource consumption (cost), such as large storage overheads or data transferring. Nodes in the same rack are connected by a ToR switch, different racks are connected by a network core which refers to the abstraction of networks. Such a system architecture inherits from previous works [13], [17]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call