A Report on Simulation-Driven Reliability and Failure Analysis of Large-Scale Storage Systems

Lipeng Wan,Qing Cao,H Sarp Oral,Sudharshan S Vazhkudai,Feiyi Wang

doi:10.2172/1185665

Abstract

High-performance computing (HPC) storage systems provide data availability and reliability using various hardware and software fault tolerance techniques. Usually, reliability and availability are calculated at the subsystem or component level using limited metrics such as, mean time to failure (MTTF) or mean time to data loss (MTTDL). This often means settling on simple and disconnected failure models (such as exponential failure rate) to achieve tractable and close-formed solutions. However, such models have been shown to be insufficient in assessing end-to-end storage system reliability and availability. We propose a generic simulation framework aimed at analyzing the reliability and availability of storage systems at scale, and investigating what-if scenarios. The framework is designed for an end-to-end storage system, accommodating the various components and subsystems, their interconnections, failure patterns and propagation, and performs dependency analysis to capture a wide-range of failure cases. We evaluate the framework against a large-scale storage system that is in production and analyze its failure projections toward and beyond the end of lifecycle. We also examine the potential operational impact by studying how different types of components affect the overall system reliability and availability, and present the preliminary results

Full Text