Parallel File System Research Articles

The popularity of machine learning technologies and frameworks has led to an increasingly large number of machine learning workloads running on high-performance computing (HPC) clusters. The ML workflows are readily being adopted in diverse computational fields such as Biology, Physics, Materials, and Computer Science. The I/O behavior of the emerging ML workloads distinctly differs from the traditional HPC workloads, such as simulation or checkpoint/restart-based HPC I/O behavior. Additionally, the ML workloads have also pushed for the utilization of GPUs or a combination of CPUs and GPUs in addition to using only CPUs for computational tasks. The diverse and complex I/O behavior of ML workloads requires extensive study and is critical for the efficient performance of various layers of the I/O stack and the overall performance of HPC workloads. This work aims to fill the gap in understanding the I/O behavior of emerging ML workloads by providing an in-depth analysis of ML jobs running on large-scale leadership HPC systems. In particular, we have analyzed the behavior of jobs based on the scale of the jobs, the science domains, and the processing units used by the ML jobs. The analysis was performed on 23,000 ML jobs collected from one year of Darshan logs running on Summit, which is one of the fastest supercomputers. We also collect the CPU and GPU usage of 15,165 ML jobs by merging the Darshan dataset with the power usage of the processing units on Summit. Therefore, this paper is able to provide a systematic I/O characterization of ML workloads on a leadership scale HPC machine to understand how the I/O behavior differs for workloads across various science domains, the scale of workloads, and processing units and analyze the usage of parallel file system and burst buffer by ML I/O workloads. We have made several observations regarding I/O performances and access patterns through various analytical studies and discuss the important lessons learnt from the perspective of a ML user and a storage architect for emerging ML workloads running on large-scale supercomputers.

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault , which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.

Parallel File System Research Articles

Related Topics

Articles published on Parallel File System

Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems

H5bench: A unified benchmark suite for evaluating HDF5 I/O performance on pre‐exascale platforms

Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File Systems

A step towards the final frontier: Lessons learned from acceptance testing of the first HPE/Cray EX 3000 system at ORNL

Dynamic Multimedia Encryption Using a Parallel File System Based on Multi-Core Processors

Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage Systems

SwMPAS-A: Scaling MPAS-A to 39 Million Heterogeneous Cores on the New Generation Sunway Supercomputer

Artificial neural networks based predictions towards the auto-tuning and optimization of parallel IO bandwidth in HPC system

Storage-Heterogeneity Aware Task-based Programming Models to Optimize I/O Intensive Applications

Optimizing Error-Bounded Lossy Compression for Scientific Data With Diverse Constraints

Seismic data IO and sorting optimization in HPC through ANNs prediction based auto-tuning for ExSeisDat

I/O performance analysis of machine learning workloads on leadership scale supercomputer

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

AI4IO: A suite of AI-based tools for IO-aware scheduling

User‐level parallel file system: Case studies and performance optimizations

Enabling machine learning-ready HPC ensembles with Merlin

Error-Controlled Data Reduction Approach for Large-Scale Structured Datasets

Cloud Computing Cloud Computing in Remote Sensing : High Performance Remote Sensing Data Processing in a Big data Environment

Applying neural networks to predict HPC-I/O bandwidth over seismic data on lustre file system for ExSeisDat

Holistic I/O Activity Characterization Through Log Data Analysis of Parallel File Systems and Interconnects

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Parallel File System Research Articles

Related Topics

Articles published on Parallel File System

Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems

H5bench: A unified benchmark suite for evaluating HDF5 I/O performance on pre‐exascale platforms

Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File Systems

A step towards the final frontier: Lessons learned from acceptance testing of the first HPE/Cray EX 3000 system at ORNL

Dynamic Multimedia Encryption Using a Parallel File System Based on Multi-Core Processors

Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage Systems

SwMPAS-A: Scaling MPAS-A to 39 Million Heterogeneous Cores on the New Generation Sunway Supercomputer

Artificial neural networks based predictions towards the auto-tuning and optimization of parallel IO bandwidth in HPC system

Storage-Heterogeneity Aware Task-based Programming Models to Optimize I/O Intensive Applications

Optimizing Error-Bounded Lossy Compression for Scientific Data With Diverse Constraints

Seismic data IO and sorting optimization in HPC through ANNs prediction based auto-tuning for ExSeisDat

I/O performance analysis of machine learning workloads on leadership scale supercomputer

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

AI4IO: A suite of AI-based tools for IO-aware scheduling

User‐level parallel file system: Case studies and performance optimizations

Enabling machine learning-ready HPC ensembles with Merlin

Error-Controlled Data Reduction Approach for Large-Scale Structured Datasets

Cloud Computing Cloud Computing in Remote Sensing : High Performance Remote Sensing Data Processing in a Big data Environment

Applying neural networks to predict HPC-I/O bandwidth over seismic data on lustre file system for ExSeisDat

Holistic I/O Activity Characterization Through Log Data Analysis of Parallel File Systems and Interconnects