FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

Yanchao Zhu,Yi Liu,Guozhen Zhang

doi:10.1109/access.2020.2975832

Abstract

As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta- and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving fault tolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems; e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library.

Highlights

With the scaling up of high performance computing (HPC) systems in recent years, resilience has become a major challenge
This paper proposes a fault-tolerant library for linear algebra computations called FT-parallel basic linear algebra subprograms (PBLAS), which can be regarded as the fault-tolerant version of PBLAS
We propose a block-checksum approach, which uses a checksum of blocks instead of rows and columns to check for computational errors in HPC systems [14]

Summary

INTRODUCTION

With the scaling up of high performance computing (HPC) systems in recent years, resilience has become a major challenge. Y. Zhu et al.: FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-Performance Computing Systems e.g., at the checkpoint time, all the nodes need to be synchronized for some particular applications, and the volume of the checkpoint data impacts the I/O infrastructure. Zhu et al.: FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-Performance Computing Systems e.g., at the checkpoint time, all the nodes need to be synchronized for some particular applications, and the volume of the checkpoint data impacts the I/O infrastructure Another extensively used software-based technique is replication [9], [10], which uses replication in different levels (e.g., the process-level) to ensure the reliable execution of applications and consumes a large amount of resources. We propose a block-checksum-based approach for fault-tolerant matrix computations in HPC systems.

FAULT MODEL

IMPLEMENTATION BASED ON PBLAS

VIII. CONCLUSION AND FUTURE WORK