A study of DRAM failures in the field

Vilas Sridharan ,Dean Liberty

doi:10.5555/2388996.2389100

Abstract

Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM errors are a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM sub-systems is warranted. In this paper, we present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings. We identify several unique DRAM failure modes, including single-bit, multi-bit, and multi-chip failures. We also provide a deterministic bound on the rate of transient faults in the DRAM array, by exploiting the presence of a hardware scrubber on our nodes. We draw several conclusions from our study. First, DRAM failures are dominated by permanent, rather than transient, faults, although not to the extent found by previous publications. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column, indicating faults in shared internal circuitry. Third, we identify a DRAM failure mode that disrupts access to other DRAM devices that share the same board-level circuitry. Finally, we find that chipkill error-correcting codes (ECC) are extremely effective, reducing the node failure rate from uncorrected DRAM errors by 42x compared to single-error correct/double-error detect (SEC-DED) ECC.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A study of DRAM failures in the field

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A study of DRAM failures in the field
Vilas Sridharan ... Dean Liberty
-
Vilas Sridharan, et. al.Vilas Sridharan ... Dean Liberty
01 Nov 2012
01 Nov 2012

A Locality-Aware Compression Scheme for Highly Reliable Embedded Systems
Juhyung Hong ... Sangwoo Han
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 38
Juhyung Hong, et. al.Juhyung Hong ... Sangwoo Han
01 Mar 2019
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 38

Characterization of data retention faults in DRAM devices
Angelo Bacchini ... Gianluca Furano
-
Angelo Bacchini, et. al.Angelo Bacchini ... Gianluca Furano
01 Oct 2014
01 Oct 2014

DRAM circuit design: a tutorial
...
Choice Reviews Online | VOL. 38
, et. al. ...
01 May 2001
Choice Reviews Online | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A study of DRAM failures in the field

Abstract

Talk to us

Similar Papers