A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault , which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.

Similar Papers
  • Conference Article
  • Cite Count Icon 12
  • 10.1109/pdsw51947.2020.00013
Fingerprinting the Checker Policies of Parallel File Systems
  • Nov 1, 2020
  • Runzhou Han + 2 more

Parallel file systems (PFSes) play an essential role in high performance computing. To ensure the integrity, many PFSes are designed with a checker component, which serves as the last line of defense to bring a corrupted PFS back to a healthy state. Motivated by real-world incidents of PFS corruptions, we perform a fine-grained study on the capability of PFS checkers in this paper. We apply type-aware fault injection to specific PFS structures, and examine the detection and repair policies of PFS checkers meticulously via a well-defined taxonomy. The study results on two representative PFS checkers show that they are able to handle a wide range of corruptions on important data structures. On the other hand, neither of them is perfect: there are multiple cases where the checkers may behave sub-optimally, leading to kernel panics, wrong repairs, etc. Our work has led to a new patch on Lustre. We hope to develop our methodology into a generic framework for analyzing the checkers of diverse PFSes, and enable more elegant designs of PFS checkers for reliable high-performance computing.

  • Conference Article
  • Cite Count Icon 33
  • 10.1145/1088149.1088192
High performance support of parallel virtual file system (PVFS2) over Quadrics
  • Jun 20, 2005
  • Weikuan Yu + 2 more

Parallel I/O needs to keep pace with the demand of high performance computing applications on systems with ever-increasing speed. Exploiting high-end interconnect technologies to reduce the network access cost and scale the aggregated bandwidth is one of the ways to increase the performance of storage systems. In this paper, we explore the challenges of supporting parallel file system with modern features of Quadrics, including user-level communication and RDMA operations. We design and implement a Quadrics-capable version of a parallel file system (PVFS2). Our design overcomes the challenges imposed by Quadrics static communication model to dynamic client/server architectures. Quadrics QDMA and RDMA mechanisms are integrated and optimized for high performance data communication. Zero-copy PVFS2 list IO is achieved with a Single Event Associated MUltiple RDMA (SEAMUR) mechanism. Experimental results indicate that the performance of PVFS2, with Quadrics user-level protocols and RDMA operations, is significantly improved in terms of both data transfer and management operations. With four IO server nodes, our implementation improves PVFS2 aggregated read bandwidth by up to 140% compared to PVFS2 over TCP on top of Quadrics IP implementation. Moreover, it delivers significant performance improvement to application benchmarks such as mpi-tile-io [24] and BTIO [26]. To the best of our knowledge, this is the first work in the literature to report the design of a high performance parallel file system over Quadrics user-level communication protocols.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/empdp.2003.1183570
A parallel and fault tolerant file system based on NFS servers
  • Jan 1, 2003
  • F Garcia + 4 more

One important piece of system software for clusters is the parallel file system. All current parallel file systems and parallel I/O libraries for clusters do not use standard servers, thus it is very difficult to use these systems in heterogeneous environments. However why use proprietary or special-purpose servers on the server end of a parallel file system when you have most of the necessary functionality in NFS servers already? This paper describes the fault tolerance implemented in Expand (Expandable Parallel File System), a parallel file system based on NFS servers. Expand allows the transparent use of multiple NFS servers as a single file system, providing a single name space. The different NFS servers are combined to create a distributed partition where files are stripped. Expand requires no changes to the NFS server and uses RPC operations to provide parallel access to the same file. Expand is also independent of the clients, because all operations are implemented using RPC and NFS protocol. Using this system, we can join heterogeneous servers (Linux, Solaris, Windows 2000, etc.) to provide a parallel and distributed partition. Fault tolerance is achieved using RAID techniques applied to parallel files. The paper describes the design of Expand and the evaluation of a prototype of Expand, using the MPI-IO interface. This evaluation has been made in Linux clusters and compares Expand with PVFS.

  • Research Article
  • Cite Count Icon 5
  • 10.5555/3019046.3019055
A generic framework for testing parallel file systems
  • Nov 13, 2016
  • Jinrui Cao + 4 more

Large-scale parallel file systems are of prime importance today. However, despite of the importance, their failure-recovery capability is much less studied compared with local storage systems. Recent studies on local storage systems have exposed various vulnerabilities that could lead to data loss under failure events, which raise the concern for parallel file systems built on top of them.This paper proposes a generic framework for testing the failure handling of large-scale parallel file systems. The framework captures all disk I/O commands on all storage nodes of the target system to emulate realistic failure states, and checks if the target system can recover to a consistent state without incurring data loss. We have built a prototype for the Lustre file system. Our preliminary results show that the framework is able to uncover the internal I/O behavior of Lustre under different workloads and failure conditions, which provides a solid foundation for further analyzing the failure recovery of parallel file systems.

  • Research Article
  • Cite Count Icon 14
  • 10.1088/1742-6596/180/1/012050
Building a parallel file system simulator
  • Jul 1, 2009
  • Journal of Physics: Conference Series
  • E Molina-Estolano + 3 more

Parallel file systems are gaining in popularity in high-end computing centers as well as commercial data centers. High-end computing systems are expected to scale exponentially and to pose new challenges to their storage scalability in terms of cost and power. To address these challenges scientists and file system designers will need a thorough understanding of the design space of parallel file systems. Yet there exist few systematic studies of parallel file system behavior at petabyte- and exabyte scale. An important reason is the significant cost of getting access to large-scale hardware to test parallel file systems. To contribute to this understanding we are building a parallel file system simulator that can simulate parallel file systems at very large scale. Our goal is to simulate petabyte-scale parallel file systems on a small cluster or even a single machine in reasonable time and fidelity. With this simulator, file system experts will be able to tune existing file systems for specific workloads, scientists and file system deployment engineers will be able to better communicate workload requirements, file system designers and researchers will be able to try out design alternatives and innovations at scale, and instructors will be able to study very large-scale parallel file system behavior in the class room. In this paper we describe our approach and provide preliminary results that are encouraging both in terms of fidelity and simulation scalability.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/hpcc/smartcity/dss.2019.00028
A Case Study on the Efficiency of User-Level Parallel File Systems
  • Aug 1, 2019
  • Chen Chen + 5 more

Improving I/O performance of large-scale computing systems has become increasingly dependent on the efficiency of parallel file systems. A common way to deploy parallel file systems is using the user space file system framework (e.g. FUSE), which introduces an extra I/O interposition layer that may cause considerable overhead due to excessive utilization of kernel crossings and system function calls. These slowdowns have been exacerbated due to the growing isolation of user and kernel space, increasing kernel crossing overhead significantly. In this paper, we present our findings on the evaluation of how FUSE affects the efficiency of a popular parallel file system, the Parallel Log-structured File System (PLFS). We then suggest a means to mitigate this issue by removing FUSE from the equation and demonstrate its viability with a proof of concept library.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/pdsw.2010.5668094
Virtualization-based bandwidth management for parallel storage systems
  • Nov 1, 2010
  • Yiqi Xu + 5 more

This paper presents a new parallel storage management approach which supports the allocation of shared storage bandwidth on a per-application basis. Existing parallel storage systems are unable to differentiate I/Os from different applications and meet per-application bandwidth requirement. This limitation presents a hurdle for applications to achieve their desired performance, which will become even more challenging as high-performance computing (HPC) systems continue to scale up with respect to both the amount of available resources and the number of concurrent applications. This paper proposes a novel solution to address this challenge through the virtualization of parallel file systems (PFSes). Such PFS virtualization is achieved with user-level PFS proxies, which interpose between native PFS clients and servers and schedule the I/Os from different applications according to the resource sharing algorithm (e.g., SFQ(D)). In this way, virtual PFSes can be created on a perapplication basis, each with a specific bandwidth share allocated according to its I/O requirement. This approach is applicable to different PFS-based parallel storage systems and can be transparently integrated with existing as well as future HPC systems. A prototype of this approach is implemented upon PVFS2, a widely used PFS, and evaluated with experiments using a typical parallel I/O benchmark (IOR). Results show that this approach's overhead is very small and it achieves effective proportional sharing under different usage scenarios.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.proenv.2011.12.050
Performance Evaluation of A Infiniband-based Lustre Parallel File System
  • Jan 1, 2011
  • Procedia Environmental Sciences
  • Yuan Wang + 4 more

Performance Evaluation of A Infiniband-based Lustre Parallel File System

  • Research Article
  • Cite Count Icon 5
  • 10.1177/1094342016677084
Rethinking key–value store for parallel I/O optimization
  • Dec 23, 2016
  • The International Journal of High Performance Computing Applications
  • Anthony Kougkas + 4 more

Key–value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key–value stores. We propose using key–value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application. We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key–value stores.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/cluster48925.2021.00099
A Scalability Study of Data Exchange in HPC Multi-component Workflows
  • Sep 1, 2021
  • Jie Yin + 3 more

Multi-component workflows play a significant role in High-Performance Computing and Big Data applications. They usually contain multiple, independently developed components that execute side-by-side to perform sophisticated computation and data exchange through file I/O over parallel file system. However, file I/O can become an impediment in such systems and cause undesirable performance degradation due to its relatively low speed (compared to the interconnect fabric), which is unacceptable especially for applications with strict time constraints. The Data Transfer Framework (DTF) is an I/O arbitration layer working with the PnetCDF I/O library aiming at eliminating the bottleneck by transparently redirecting file I/O operations through the parallel file system to message passing via the high-speed interconnect between coupled components. Scalable and high-speed data transfer between components can be thus easily achieved with minimal development effort by using DTF. However, previous work provides insufficient scalability evaluation of the framework. In order to comprehensively evaluate the scalability of an I/O middleware like DTF and highlight its major advantages, we develop an I/O benchmark for multicomponent workflows. Using the benchmark we conduct large-scale scalability evaluation using up to 32,768 compute nodes on supercomputer Fugaku and 2,048 compute nodes on Oakforest-PACS by comparing direct data transfer to file I/O performed on Lustre file system and Fugaku’s Lightweight Layered IO-Accelerator (LLIO). We provide insights into DTF’s scalability and performance enhancements with the intention to impact future I/O middleware and inter-component data exchange design in multi-component workflows.

  • Conference Article
  • Cite Count Icon 45
  • 10.1109/ipdps.2015.83
High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA
  • May 1, 2015
  • Md Wasi-Ur-Rahman + 4 more

The viability and benefits of running MapReduce over modern High Performance Computing (HPC) clusters, with high performance interconnects and parallel file systems, have attracted much attention in recent times due to its uniqueness of solving data analytics problems with a combination of Big Data and HPC technologies. Most HPC clusters follow the traditional Beowulf architecture with a separate parallel storage system (e.g. Lustre) and either no, or very limited, local storage. Since the MapReduce architecture relies heavily on the availability of local storage media, the Lustre-based global storage system in HPC clusters poses many new opportunities and challenges. In this paper, we propose a novel high-performance design for running YARN MapReduce on such HPC clusters by utilizing Lustre as the storage provider for intermediate data. We identify two different shuffle strategies, RDMA and Lustre Read, for this architecture and provide modules to dynamically detect the best strategy for a given scenario. Our results indicate that due to the performance characteristics of the underlying Lustre setup, one shuffle strategy may outperform another in different HPC environments, and our dynamic detection mechanism can deliver best performance based on the performance characteristics obtained during runtime of job execution. Through this design, we can achieve 44% performance benefit for shuffle-intensive workloads in leadership-class HPC systems. To the best of our knowledge, this is the first attempt to exploit performance characteristics of alternate shuffle strategies for YARN MapReduce with Lustre and RDMA.

  • Conference Article
  • Cite Count Icon 23
  • 10.1109/msst.2012.6232370
VPFS: Bandwidth virtualization of parallel storage systems
  • Apr 1, 2012
  • Yiqi Xu + 5 more

Existing parallel file systems are unable to differentiate I/Os requests from concurrent applications and meet per-application bandwidth requirements. This limitation prevents applications from meeting their desired Quality of Service (QoS) as high-performance computing (HPC) systems continue to scale up. This paper presents vPFS, a new solution to address this challenge through a bandwidth virtualization layer for parallel file systems. vPFS employs user-level parallel file system proxies to interpose requests between native clients and servers and to schedule parallel I/Os from different applications based on configurable bandwidth management policies. vPFS is designed to be generic enough to support various scheduling algorithms and parallel file systems. Its utility and performance are studied with a prototype which virtualizes PVFS2, a widely used parallel file system. Enhanced proportional sharing schedulers are enabled based on the unique characteristics (parallel striped I/Os) and requirement (high throughput) of parallel storage systems. The enhancements include new threshold- and layout-driven scheduling synchronization schemes which reduce global communication overhead while delivering total-service fairness. An experimental evaluation using typical HPC benchmarks (IOR, NPB BTIO) shows that the throughput overhead of vPFS is small (<;3% for write, <;1% for read). It also shows that vPFS can achieve good proportional bandwidth sharing (>;96% of target sharing ratio) for competing applications with diverse I/O patterns.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/discs.2014.11
Rethinking Key-Value Store for Parallel I/O Optimization
  • Nov 1, 2014
  • Yanlong Yin + 7 more

Key-Value Stores (KVStore) are being widely used as the storage system for large-scale Internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems (PFS) are the dominant storage systems. In this study, we carefully examine the architecture difference and performance characteristics of PFS and KVStore. We propose that it is valuable to utilize KVStore to optimize the overall I/O performance, especially for the workloads that PFS cannot handle well, such as the cases with hurtful data synchronization or heavy metadata operations. To verify this proposal, we conducted comprehensive experiments with several synthetic benchmarks, an I/O benchmark, and a real application. The results show that our proposal is promising.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/pdcat46702.2019.00021
I/O Scheduling for Limited-Size Burst-Buffers Deployed High Performance Computing
  • Dec 1, 2019
  • Benbo Zha + 1 more

Burst-Buffers is a high throughput, small size intermediate storage system integrated between computing nodes and permanent storage system to mitigate the I/O bottleneck problem in modern High Performance Computing (HPC) platforms. This system, however, is unable to effectively handle variable-intensity I/O bursts resulted by unpredictable concurrent accesses to the shared Parallel File System (PFS). In this paper, we introduce a probabilistic I/O scheduling method that takes into account of the burst-buffer load state and instantaneous I/O load distribution of the system based on the probabilistic model of applications to relieve the I/O congestion when I/O load exceeds the PFS bandwidth caused by dynamic application interference. The proposed scheduling method for limited-size Burst-Buffers deployed HPC platforms makes online decision of probabilistic selection of concurrent I/O requests for going through (to PFS), buffering (to Burst-Buffers) or declination in accordance to both the available I/O bandwidth and the current buffer state in order to maximize system efficiency or minimize application dilation. Extensive experiment results on actual characteristic synthetic data show that our method handles the I/O congestion effectively.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/hpcc/smartcity/dss.2018.00047
Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems
  • Jun 1, 2018
  • Nana Wang + 3 more

Checkpointing is the most widely used technique in high performance computing systems to tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with the scaling up of high performance computers, the number of processors and computing nodes increase rapidly, which brings I/O impact of checkpointing to the systems. On arriving at a checkpoint, all the nodes generate checkpoint data and write them to the storage system simultaneously, causing burst and massive traffics and data to the I/O infrastructure including interconnection network, parallel file system and storage. To mitigate the I/O impact of checkpointing, this paper proposes a self-adaptive random delay approach to control the writing of checkpointing data. By generating checkpoint data simultaneously in each node and writing the data according to a self-adaptive random delay policy, the burst traffic and data are smoothed. Experiment and theoretical analysis results show that this approach can mitigate I/O impact of checkpointing on large scale parallel systems.

Save Icon
Up Arrow
Open/Close