Characterizing and Mitigating Soft Errors in GPU DRAM

Michael B Sullivan,Donghyuk Lee,Saurabh Hukerikar,Nirmal R Saxena,Paul Racunas,Mike O'Connor,Timothy Tsai,Siva Kumar Sastry Hari,Stephen W Keckler

doi:10.1109/mm.2022.3163122

Abstract

While graphics processing units (GPUs) are used in high-reliability systems, wide GPU dynamic random-access memory (DRAM) interfaces make error protection difficult, as wide-device correction through error checking and correcting (ECC) is expensive and impractical. This challenge is compounded by worsening relative rates of multibit DRAM errors and increasing GPU memory capacities. This work uses high-energy neutron beam tests to inform the design and evaluation of GPU DRAM error-protection mechanisms. Based on observed locality in multibit error patterns, we propose several novel ECC schemes to decrease the silent data corruption (SDC) risk by up to five orders of magnitude relative to single-bit-error-correcting and double-bit-error-detecting (SEC-DED) ECC, while also reducing the number of uncorrectable errors by up to 7.87×. We compare novel binary and symbol-based ECC organizations that differ in their design complexity and hardware overheads, ultimately recommending two promising organizations. These schemes replace SEC-DED ECC with no additional redundancy, likely no performance degradation, and modest area and complexity costs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Characterizing and Mitigating Soft Errors in GPU DRAM

Abstract

Talk to us

Similar Papers

More From: IEEE Micro

Lead the way for us

Journal: IEEE Micro	Publication Date: Jul 1, 2022
Citations: 2

Similar Papers

An experimental 1-Mbit cache DRAM with ECC
M Asakura ... Y Matsuda
IEEE Journal of Solid-State Circuits | VOL. 25
M Asakura, et. al.M Asakura ... Y Matsuda
01 Jan 1990
IEEE Journal of Solid-State Circuits | VOL. 25

Frugal ECC
Jungrae Kim ... Mattan Erez
-
Jungrae Kim, et. al.Jungrae Kim ... Mattan Erez
15 Nov 2015
15 Nov 2015

Identification of NAND flash ECC algorithms in mobile devices
Li Zhang ... Qi-Kun Zhang
Digital Investigation | VOL. 9
Li Zhang, et. al.Li Zhang ... Qi-Kun Zhang
19 May 2012
Digital Investigation | VOL. 9

Aggressive leakage reduction of SRAMs using error checking and correcting (ECC) techniques
Afshin Nourivand ... Asim J Al-Khalili
-
Afshin Nourivand, et. al.Afshin Nourivand ... Asim J Al-Khalili
01 Aug 2008
01 Aug 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Characterizing and Mitigating Soft Errors in GPU DRAM

Abstract

Talk to us

Similar Papers

More From: IEEE Micro