Commodity Servers Research Articles

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of dual in-line memory module (DIMM) days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology, and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000--70,000 errors per billion device hours per Mb and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we do not observe any indication that newer generations of DIMMs have worse error behavior.

Read full abstract

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.

Read full abstract

Commodity Servers Research Articles

Related Topics

Articles published on Commodity Servers

A Portable Method to Improve the Performance of Commodity Servers for Massive Data Delivery

Zone-based data striping for cloud storage

Lightweight and Informative Traffic Metrics for Data Center Monitoring

Improved parallelism and scheduling in multi-core software routers

DRAM errors in the wild

Scalable and Cost-Effective Interconnection of Data-Center Servers Using Dual Server Ports

Succinct data structures for assembling large genomes

Using Paxos to build a scalable, consistent, and highly available datastore

The design of a practical system for fault-tolerant virtual machines

DataGarage

The little engine(s) that could

Cassandra

Towards a cost-effective networking testbed

The case for RAMClouds

Data-Intensive Text Processing with MapReduce

DRAM errors in the wild

Linear-scaling density-functional theory with tens of thousands of atoms: Expanding the scope and scale of calculations with ONETEP

Fault-tolerant stream processing using a distributed, replicated file system

Bigtable

10Gb/s Ethernet performance and retrospective

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Commodity Servers Research Articles

Related Topics

Articles published on Commodity Servers

A Portable Method to Improve the Performance of Commodity Servers for Massive Data Delivery

Zone-based data striping for cloud storage

Lightweight and Informative Traffic Metrics for Data Center Monitoring

Improved parallelism and scheduling in multi-core software routers

DRAM errors in the wild

Scalable and Cost-Effective Interconnection of Data-Center Servers Using Dual Server Ports

Succinct data structures for assembling large genomes

Using Paxos to build a scalable, consistent, and highly available datastore

The design of a practical system for fault-tolerant virtual machines

DataGarage

The little engine(s) that could

Cassandra

Towards a cost-effective networking testbed

The case for RAMClouds

Data-Intensive Text Processing with MapReduce

DRAM errors in the wild

Linear-scaling density-functional theory with tens of thousands of atoms: Expanding the scope and scale of calculations with ONETEP

Fault-tolerant stream processing using a distributed, replicated file system

Bigtable

10Gb/s Ethernet performance and retrospective