SummaryThe Blue Waters system, installed in 2012 at NCSA, has the largest component count of any system Cray has built. Blue Waters includes a mix of dual‐socket CPU (XE) and single‐socket CPU, single GPU (XK) nodes. The primary storage is provided by Cray's Sonexion/ClusterStor Luster storage system delivering 35 PB (raw) storage at 1 TB/s. The statistical failure rates over time for each component including CPU, DIMM, GPU, disk drive, power supply, blower, etc and their impact on higher level failure rates for individual nodes and the systems as a whole are presented in detail, with a particular emphasis on identifying any increases in rate that might indicate the right‐side of the expected bathtub curve has been reached. Strategies employed by NCSA and Cray for minimizing the impact of component failure, such as the preemptive removal of suspect disk drives, are also presented.
Read full abstract