Abstract

The trends in technology scaling and the reduction in supply voltages have significantly improved the performance and energy consumption in modern microprocessors. Microprocessors are being built with higher degrees of spatial parallelism and deeper pipelines to improve performance, which, however, makes them more susceptible to transient faults. Radiation causes "transient faults” or "single-event transients” in logic, which, once propagated and latched, become full cycle errors or soft errors. If radiation hits memory elements, this is usually called an "single-event upset” or "soft error” as it can further propagate as a full cycle error. The problem of soft errors is further exacerbated in large multiprocessors employed in servers in which reliability is a key concern. In the past, the technique of lockstep execution of the original and the duplicate instructions has been used for error detection in multiprocessors. However, the execution of redundant threads in the on-chip multiprocessor (CMP) provides error detection at lower overheads, since the branch outcomes of the leading thread can be exploited during the execution of the trailing thread, and also because the interprocessor communication latency is a key concern for lockstepping. In this paper, we show that by mining various redundancies inherent within a single core, the interprocessor communication can be brought down to a minimum. Toward this, we propose techniques based on 1) temporal redundancy, 2) data value redundancy, and 3) information redundancy for error detection in multicore designs. We exploit temporal redundancy by using the "latency slack cycles” (LSC) of an instruction, which we define as the number of cycles before the computed result from the instruction becomes the source operand of a subsequent instruction. The value-based detection technique is explored by exploiting the width of the operands with small data values and information redundancy is exploited by the generation of residue code check bits for the source operands. We show that with a clustered core multiprocessor, the interprocessor communication overhead can be significantly reduced. In our proposed multicore design, when a soft error is detected, error correction is achieved by rolling back the execution to a previous checkpoint state and re-executing the instructions. The proposed techniques have been implemented on the RSIM simulation framework and validated using the SPLASH benchmarks. Experimental results indicate that the soft error detection schemes proposed in this work, can be implemented, on the average, with less than 10 percent increase in CPI on modern multicore designs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call