Abstract

The effects of soft errors in processor cores have been widely studied. However, little has been published about soft errors in uncore components, such as the memory subsystem and I/O controllers, of a system-on-a-chip (SoC). In this paper, we study how soft errors in uncore components affect system-level behaviors. We have created a new mixed-mode simulation platform that combines simulators at two different levels of abstraction, and achieves $20~000{\times }$ speedup over register-transfer-level-only simulation. Using this platform, we present the first study of the system-level impact of soft errors inside various uncore components of a large-scale, multicore SoC using the industrial-grade, open-source OpenSPARC T2 SoC design. Our results show that soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that uncore soft errors can create major challenges for traditional system-level checkpoint recovery techniques. To overcome such recovery challenges, we present a new replay recovery technique for uncore components belonging to the memory subsystem. For the L2 cache controller and the dynamic random-access memory controller components of OpenSPARC T2, our new technique reduces the probability that an application run fails to produce correct results due to soft errors by more than $50 {\times }$ with 1.82% and 2.58% chip-level area and power impact, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call