Mixed precision support in HPC applications: What about reliability?

Alessio Netti,Yang Peng,Patrik Omland,Michael Paulitsch,Jorge Parra,Gustavo Espinosa,Udit Agarwal,Abraham Chan,Karthik Pattabiraman

doi:10.1016/j.jpdc.2023.104746

Abstract

In their quest for exascale and beyond, High-Performance Computing (HPC) systems continue becoming ever larger and more complex. Application developers, on the other hand, leverage novel methods to improve the efficiency of their own codes: a recent trend is the use of floating-point mixed precision, or the careful interlocking of single- and double-precision arithmetic, as a tool to improve performance as well as reduce network and memory boundedness. However, while it is known that modern HPC systems suffer hardware faults at daily rates, the impact of reduced precision on application reliability is yet to be explored. In this work we aim to fill this gap: first, we propose a qualitative survey to identify the branches of HPC where mixed precision is most popular. Second, we show the results of instruction-level fault injection experiments on a variety of representative HPC workloads, comparing vulnerability to Silent Data Errors (SDEs) under different numerical configurations. Our experiments indicate that use of single and mixed precision leads to comparatively more frequent and more severe SDEs, with concerning implications regarding their use on extreme-scale, fault-prone HPC platforms.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Mixed precision support in HPC applications: What about reliability?

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing

Lead the way for us

Similar Papers

HPC Process and Optimal Network Device Affinitization
Ravindra Babu Ganapathi ... Aravind Gopalakrishnan
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4
Ravindra Babu Ganapathi, et. al.Ravindra Babu Ganapathi ... Aravind Gopalakrishnan
01 Oct 2018
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4

DWPE, a new data center energy-efficiency metric bridging the gap between infrastructure and workload
Torsten Wilde ... Axel Auweter
-
Torsten Wilde, et. al.Torsten Wilde ... Axel Auweter
01 Jul 2014
01 Jul 2014

Design of robust scheduling methodologies for high performance computing

-

01 Jan 2019
01 Jan 2019

Bridging the Divide Between HPC and Commodity System Software
John R Lange
-
John R LangeJohn R Lange
15 Jun 2015
15 Jun 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mixed precision support in HPC applications: What about reliability?

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing