Abstract
This paper investigated DRAM DIMM errors using field records in replacement network servers. Large DRAM samples of about 40 K were collected over a 2.5 years period from 23 different server types, included various DIMMs from three different DRAM manufacturers with densities between 4 and 128 GB, and speeds between 1066 and 2400 Mbps. Errors that occurred during system operation were classified as either correctable (CE) or uncorrectable (UE) errors based on error correction code (ECC) schemes built into the servers. Of the collected DIMMS, 24% had recorded errors, where CE-only, UE-only, and UE and CE together comprised 28%, 43%, and 29% of recorded errors, respectively. Since UEs can cause large-scale failures, systems are replaced upon any UE occurrence. Approximately half UE-only DIMMs had 1 UE error. In contrast, many DIMMs had billions of CE errors, where a faulty location may be repetitively accessed. Such drastic differences in UE and CE counts help explain the importance of ECC and error mitigation schemes. Comparative analyses of errors were made over the manufacturers and operating speeds. After reasonable adjustments for repetitive counts of errors, failure in time (FIT) differences were up to 38% over manufacturers. Higher speed DIMMs generally had higher FIT with 2400 Mbps DIMMs exhibiting 6.7 times FIT of 1066 Mbps DIMMs.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.