Abstract

Soft error is a type of transient errors that occur due in part to reductions in capacitance and operating voltages in modern electronic components. Recently, the problem of soft errors has become more prevalent due to several design factors, including aggressive device scaling and newer energy-efficient designs, thus significantly threatening the reliability of computer systems. Since the occurrence of soft errors is non-deterministic, detecting them and recovering from them can be quite challenging. A common way to detect soft errors is to execute two identical program instances and then compare their results. Although this approach is effective, it is not efficient as both non-trivial computation and memory resources must be invested to support such redundant executions. The introduction of hardware transactional memory (HTM) in modern chip multiprocessors (CMPs) provides an opportunity to leverage its redundant information to address the emerging reliability concerns including soft errors. In this work, we propose and implement a reliability-enhanced HTM system, called RE-HTM, which leverages redundancy to detect soft errors occurring in the L1 data cache and then recover from them. We then empirically evaluate RE-HTM and the results indicate that RE-HTM is more effective than the existing approach of running two redundant execution instances while incurring lower runtime overhead on protecting L1 data cache from soft errors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call