Application-based fault tolerance techniques for sparse matrix solvers

Simon Mcintosh–Smith,Alex Warwick Vesztrocy,Rob Hunt,James Price

doi:10.1177/1094342017694946

Abstract

High-performance computing systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future exascale systems being more susceptible to soft errors caused by cosmic radiation than in current high-performance computing systems. Through the use of techniques such as hardware-based error-correcting codes and checkpoint-restart, many of these faults can be mitigated at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10–20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware costs, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in high-performance computing: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency, and fault tolerance of the overall solution.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: The International Journal of High Performance Computing Applications	Publication Date: May 10, 2017
Citations: 2	License type: other-oa

R Discovery Prime

R Discovery Prime

Application-based fault tolerance techniques for sparse matrix solvers

Abstract

Talk to us

Similar Papers

More From: The International Journal of High Performance Computing Applications

Lead the way for us

Similar Papers

Exploiting Spatial Information in Datasets to Enable Fault Tolerant Sparse Matrix Solvers
Rob Hunt ... Simon Mcintosh-Smith
-
Rob Hunt, et. al.Rob Hunt ... Simon Mcintosh-Smith
01 Sep 2015
01 Sep 2015

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems
Yanchao Zhu ... Guozhen Zhang
IEEE Access | VOL. 8
Yanchao Zhu, et. al.Yanchao Zhu ... Guozhen Zhang
01 Jan 2020
IEEE Access | VOL. 8

Fault tolerance of MPI applications in exascale systems: The ULFM solution
Nuria Losada ... Keita Teranishi
Future Generation Computer Systems | VOL. 106
Nuria Losada, et. al.Nuria Losada ... Keita Teranishi
20 Jan 2020
Future Generation Computer Systems | VOL. 106

A Validation Approach for Quasi-Synchronous Checkpointing Algorithms in HPC Systems
Houda Khlif ... Ahmed Hadj Kacem
-
Houda Khlif, et. al.Houda Khlif ... Ahmed Hadj Kacem
01 Oct 2017
01 Oct 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Application-based fault tolerance techniques for sparse matrix solvers

Abstract

Talk to us

Similar Papers

More From: The International Journal of High Performance Computing Applications