Abstract

The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework—so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call