ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination

Tim Breitenbach,Shrikanth Malavalli Divakar,Lauritz Rasbach,Patrick Jahnke

doi:10.1016/j.jpdc.2023.104800

Abstract

With the trend towards multi-socket server systems, the demand for random access memory (RAM) per server increased. The consequence are more DIMM sockets per server. Since every dual in-line memory module (DIMM), which comprises a series of dynamic random-access memory integrated circuits, has a probability of failure, RAM issues became a dominant failure pattern for servers. The concept introduced in this work contributes to improving the reliability of data centers by avoiding RAM failures and mitigating their impact. For this purpose, an ML-driven framework is provided to estimate the probability of memory failure for each RAM module. The ML framework is based on structural information between correctable (CE) and uncorrectable errors (UE). In a common memory scenario, a corrupted bit within a module can be restored by redundancy using an error correction code (ECC), resulting in a CE. However, if there is more than one corrupted bit within a group of bits covered by the ECC, the information cannot be restored, resulting in a UE.Consequently, the related task requesting the memory content, and the corresponding service may crash. There is evidence that UEs have a CE history and structural relation between the CEs. However, for the case of UEs without a CE history or of a false decision of the ML framework, we extend the total framework by engineering measures to mitigate the impact of a UE by avoiding kernel panic and using backups. The engineering measures use a mapping between physical and logical memory addresses.

Full Text