From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell

Cong Li,Yu Zhang,Shen Zhou,Hang Chen,Tai Huang,Lixin Wang,Liang Peng,Shijian Ge,Jialei Wang,Xian Liu

doi:10.1109/sc41404.2022.00081

Abstract

Uncorrectable memory errors are one of the major failure causes in datacenters. In this paper, we present an empirical study correlating correctable errors (CEs) and uncorrectable errors (UEs) using the large-scale field data across 3 major dual in-line memory module (DIMM) manufacturers from a contemporary server farm of ByteDance. Different from the previous studies, our study is the first to comprehend the error-bit information of CEs and the DIMM part numbers. Unlike the traditional chipkill error correction code (ECC), in contemporary Intel server platforms the ECC gets weakened, not able to tolerate some error-bit patterns from a single chip. Using obtainable coarse-grained ECC knowledge, we derive a new indicator from the error-bit information: risky CE occurrence in terms of ECC guaranteed coverage. From the data, we show that the new indicator has a consistently high sensitivity and specificity in the test of future UE occurrences across DIMMs from different manufacturers. This leads us to conjecture that the weakened ECC substantially contributes to many UEs today. The new risky CE indicator is then applied in predicting the future UE occurrence based on the CE history. We empirically demonstrate how practically useful predictors are constructed in conjunction with other useful attributes such as certain micro-level fault indicators and DIMM part numbers, achieving the state-of-the-art performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination
Tim Breitenbach ... Patrick Jahnke
Journal of Parallel and Distributed Computing | VOL. 185
Tim Breitenbach, et. al.Tim Breitenbach ... Patrick Jahnke
18 Nov 2023
Journal of Parallel and Distributed Computing | VOL. 185

DECO: DIMM controller efficient for ECC operations
Wooyoung Jang
Electronics Letters | VOL. 50
Wooyoung JangWooyoung Jang
01 Sep 2014
Electronics Letters | VOL. 50

A Locality-Aware Compression Scheme for Highly Reliable Embedded Systems
Juhyung Hong ... Sangwoo Han
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 38
Juhyung Hong, et. al.Juhyung Hong ... Sangwoo Han
01 Mar 2019
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | VOL. 38

Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data
Xiaoming Du ... Jing Li
-
Xiaoming Du, et. al.Xiaoming Du ... Jing Li
01 Sep 2020
01 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell

Abstract

Talk to us

Similar Papers