Abstract

Early in the deployment of the ASC Q cluster supercomputer system, an unexpectedly high rate of soft errors were observed in the board-level cache subsystems of the constituent AlphaServer ES45 systems that make up the compute component of this large cluster. A series of tests and experiments was undertaken to validate the hypothesis that this frequency was consistent with the high level of terrestrial secondary cosmic-ray neutron flux resulting from the high elevation of its installation site. The overall success of this effort is reported elsewhere in this issue. This paper reports on three secondary phenomena that were observed during these tests and experiments: Error logs were collected from all servers during a representative period and examined for nonrandom event rates, which would indicate a systematic cause. The only significant result of this exploration was the discovery of a latent soft-error discovery effect, and a self-shielding effect, whereby the servers positioned physically higher in their racks suffered disproportionately higher soft-error rates. This excess was examined and found to be consistent with established shielding effect of the high-Z composition of the constituents of the overlying systems. Experiments with individual ES45 systems in an artificial neutron beam at the Los Alamos Neutron Science Center facility have established that the soft-error rates observed in the SRAM parts is significantly dependent on the incident direction of the neutrons in the beam. These asymmetries could be exploited as part of a strategy for mitigating the frequency of soft errors in future computer systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call