Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo

Scott Levy,Vilas Sridharan,Nathan Debardeleben,Kurt B Ferreira,Taniya Siddiqua,Elisabeth Baseman

doi:10.1109/sc.2018.00046

Abstract

Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent work demonstrates that hardware failures are expected to become more common. Few existing studies, however, have examined failures in the context of the entire lifetime of a single platform. In this paper, we analyze a corpus of empirical failure data collected over the entire five-year lifetime of Cielo, a leadership-class HPC system. Our analysis reveals several important findings about failures on Cielo: (i) its memory (DRAM and SRAM) exhibited no aging effects; detectable, uncorrectable errors (DUE) showed no discernible increase over its five-year lifetime; (ii) contrary to popular belief, correctable DRAM faults are not predictive of future uncorrectable DRAM faults; (iii) the majority of system down events have no identifiable hardware root cause, highlighting the need for more comprehensive logging facilities to improve failure analysis on future systems; and (iv) continued advances will be needed in order for current failure mitigation techniques to be viable on future systems. Our analysis of this corpus of empirical data provides critical analysis of, and guidance for, the deployment of extreme-scale systems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems
Byoung Uk Kim
The Journal of Supercomputing | VOL. 61
Byoung Uk KimByoung Uk Kim
22 Jul 2011
The Journal of Supercomputing | VOL. 61

Improving HPC Application Performance in Public Cloud
Rashid Hassani ... Peter Luksch
IERI Procedia | VOL. 10
Rashid Hassani, et. al.Rashid Hassani ... Peter Luksch
01 Jan 2014
IERI Procedia | VOL. 10

HPC Process and Optimal Network Device Affinitization
Ravindra Babu Ganapathi ... Aravind Gopalakrishnan
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4
Ravindra Babu Ganapathi, et. al.Ravindra Babu Ganapathi ... Aravind Gopalakrishnan
01 Oct 2018
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4

Memory Errors in Modern Systems
Vilas Sridharan ... Sudhanva Gurumurthi
ACM SIGPLAN Notices | VOL. 50
Vilas Sridharan, et. al.Vilas Sridharan ... Sudhanva Gurumurthi
14 Mar 2015
ACM SIGPLAN Notices | VOL. 50

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo

Abstract

Talk to us

Similar Papers