Abstract
Uncorrectable memory errors are one of the major failure causes in datacenters. In this paper, we present an empirical study correlating correctable errors (CEs) and uncorrectable errors (UEs) using the large-scale field data across 3 major dual in-line memory module (DIMM) manufacturers from a contemporary server farm of ByteDance. Different from the previous studies, our study is the first to comprehend the error-bit information of CEs and the DIMM part numbers. Unlike the traditional chipkill error correction code (ECC), in contemporary Intel server platforms the ECC gets weakened, not able to tolerate some error-bit patterns from a single chip. Using obtainable coarse-grained ECC knowledge, we derive a new indicator from the error-bit information: risky CE occurrence in terms of ECC guaranteed coverage. From the data, we show that the new indicator has a consistently high sensitivity and specificity in the test of future UE occurrences across DIMMs from different manufacturers. This leads us to conjecture that the weakened ECC substantially contributes to many UEs today. The new risky CE indicator is then applied in predicting the future UE occurrence based on the CE history. We empirically demonstrate how practically useful predictors are constructed in conjunction with other useful attributes such as certain micro-level fault indicators and DIMM part numbers, achieving the state-of-the-art performance.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.