Abstract

As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta- and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving fault tolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems; e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library.

Highlights

  • With the scaling up of high performance computing (HPC) systems in recent years, resilience has become a major challenge

  • This paper proposes a fault-tolerant library for linear algebra computations called FT-parallel basic linear algebra subprograms (PBLAS), which can be regarded as the fault-tolerant version of PBLAS

  • We propose a block-checksum approach, which uses a checksum of blocks instead of rows and columns to check for computational errors in HPC systems [14]

Read more

Summary

INTRODUCTION

With the scaling up of high performance computing (HPC) systems in recent years, resilience has become a major challenge. Y. Zhu et al.: FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-Performance Computing Systems e.g., at the checkpoint time, all the nodes need to be synchronized for some particular applications, and the volume of the checkpoint data impacts the I/O infrastructure. Zhu et al.: FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-Performance Computing Systems e.g., at the checkpoint time, all the nodes need to be synchronized for some particular applications, and the volume of the checkpoint data impacts the I/O infrastructure Another extensively used software-based technique is replication [9], [10], which uses replication in different levels (e.g., the process-level) to ensure the reliable execution of applications and consumes a large amount of resources. We propose a block-checksum-based approach for fault-tolerant matrix computations in HPC systems.

FAULT MODEL
IMPLEMENTATION BASED ON PBLAS
VIII. CONCLUSION AND FUTURE WORK

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.