A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

Runzhou Han,Om Rameshwar Gatla,Dong Dai,Di Zhang,Mai Zheng,Jinrui Cao,Jonathan Cook,Yong Chen

doi:10.1145/3483447

Abstract

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault , which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Storage

Lead the way for us

Journal: ACM Transactions on Storage	Publication Date: Apr 28, 2022
Citations: 6

Similar Papers

Fingerprinting the Checker Policies of Parallel File Systems
Runzhou Han ... Mai Zheng
-
Runzhou Han, et. al.Runzhou Han ... Mai Zheng
01 Nov 2020
01 Nov 2020

High performance support of parallel virtual file system (PVFS2) over Quadrics
Weikuan Yu ... Shuang Liang
-
Weikuan Yu, et. al.Weikuan Yu ... Shuang Liang
20 Jun 2005
20 Jun 2005

A parallel and fault tolerant file system based on NFS servers
F Garcia ... J Fernandez
-
F Garcia, et. al.F Garcia ... J Fernandez
01 Jan 2003
01 Jan 2003

A generic framework for testing parallel file systems
...
-
, et. al. ...
13 Nov 2016
13 Nov 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Storage