Detecting silent data corruption through data dynamic monitoring for scientific applications

Leonardo Bautista Gomez,Franck Cappello

doi:10.1145/2692916.2555279

Abstract

Parallel programming has become one of the best ways to express scientific models that simulate a wide range of natural phenomena. These complex parallel codes are deployed and executed on large-scale parallel computers, making them important tools for scientific discovery. As supercomputers get faster and larger, the increasing number of components is leading to higher failure rates. In particular, the miniaturization of electronic components is expected to lead to a dramatic rise in soft errors and data corruption. Moreover, soft errors can corrupt data silently and generate large inaccuracies or wrong results at the end of the computation. In this paper we propose a novel technique to detect silent data corruption based on data monitoring. Using this technique, an application can learn the normal dynamics of its datasets, allowing it to quickly spot anomalies. We evaluate our technique with synthetic benchmarks and we show that our technique can detect up to 50% of injected errors while incurring only negligible overhead.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Detecting silent data corruption through data dynamic monitoring for scientific applications

Abstract

Talk to us

Similar Papers

More From: ACM SIGPLAN Notices

Lead the way for us

Journal: ACM SIGPLAN Notices	Publication Date: Feb 6, 2014
Citations: 16

Similar Papers

Detecting silent data corruption through data dynamic monitoring for scientific applications
Leonardo Bautista Gomez ... Franck Cappello
-
Leonardo Bautista Gomez, et. al.Leonardo Bautista Gomez ... Franck Cappello
06 Feb 2014
06 Feb 2014

Sirius: Neural Network Based Probabilistic Assertions for Detecting Silent Data Corruption in Parallel Programs
Tara E Thomas ... Anmol J Bhattad
-
Tara E Thomas, et. al.Tara E Thomas ... Anmol J Bhattad
01 Sep 2016
01 Sep 2016

A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems
Khanh N Dang ... Yuichi Okuyama
The Journal of Supercomputing | VOL. 73
Khanh N Dang, et. al.Khanh N Dang ... Yuichi Okuyama
13 Jan 2017
The Journal of Supercomputing | VOL. 73

Soft-Error Tolerant Quasi Delay-insensitive Circuits

-

01 Jan 2008
01 Jan 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Detecting silent data corruption through data dynamic monitoring for scientific applications

Abstract

Talk to us

Similar Papers

More From: ACM SIGPLAN Notices