Abstract

Massive storage systems composed of tens of thou-sands of disks are increasingly common in high-performance computing data centers. With such an enormous number of components integrated within the storage system the probability for correlated failures across a large number of components becomes a critical concern in preventing data loss. In this paper we reconsider the efficiency of traditional declustered parity data protection schemes in the presence of correlated failures. To better protect against correlated failures we introduce Single-Overlap Declustered Parity (SODP), a novel declustered parity design that tolerates more disk failures than traditional declus-tered parity. We then introduce CoFaCTOR, a tool for exploring operational reliability in the presence of many types of correlated failures. By seeding CoFaCTOR with real failure traces from LANL's data center we are able to create a failure model that accurately describes the existing file system's failure model and can use that model to generate failure data for hypothetical system designs. Our evaluation using CoFaCTOR traces shows that when compared to the state of the art our SODP-based placement algorithms can achieve a 30x improvement in the probability of data loss during failure bursts and achieves similar data protection using only half as much parity overhead.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call