High dependability has become a paramount requirement for computing systems, as they are increasingly used in business & life critical applications. Advances in the design & manufacturing of semiconductor devices have increased the performance of computing systems at a dazzling pace. However, smaller transistor dimensions, lower power voltages, and higher operating frequencies have negatively impacted dependability by increasing the probability of occurrence of transient & intermittent faults. This paper discusses the main trends in dependability of semiconductor devices, and presents a candidate architecture for a fault-tolerant microprocessor. Dependability of the processor is analyzed, and the advantages provided by fault tolerance are underscored. The effect of the higher rates of occurrence of the transient & intermittent faults on a typical microprocessor is evaluated with the aid of GSPN modeling. Dependability analysis shows that a five times increase of the rate of occurrence of the transients leads to about five time lower MTBF, if no error recovery mechanisms are employed. Significantly lower processor availability is also observed. The fault-tolerant processor is devised to mitigate the impact of the higher transient & intermittent fault rates. The processor is based on core redundancy & state checkpointing, and supports three levels of error recovery. First, recovery from a saved state (SSRC) is attempted. The second level consists of a retry (SSRR), and is activated when the first level of recovery fails. Processor reset, followed by reintegration under the operating system control (RB), is the third level of recovery. Dependability analysis, based on GSPN, shows that fault-tolerance features of the processor preserve the MTBF, even if the rate of the transient faults nearly doubles. In terms of availability, a four-time increase of the rate of occurrence of the transients is compensated. The effect of intermittent faults is also analyzed. A five-time increase of the failure rate of the intermittent faults may lower MTBF by 31% to 33%. MTBF decreases even more, by 45% to 67%, if bursts of errors are considered. Intermittent faults have a negative impact on availability as well. Maintaining the dependability of complex integrated circuits to the level available today is becoming a challenge as semiconductor integration continues at a fast pace. Fault avoidance techniques, mainly based on process technology & circuit design, will no be able to fully mitigate the impact of higher rates of occurrence of transient & intermittent faults. As a result, fault-tolerant features, specific to custom designed components today, ought to be employed by COTS circuits, in the future. Enhanced concurrent error detection & correction, self checking circuits, space & time redundancy, triplication, and voting all need to be integrated into semiconductor devices in general, and microprocessors in particular, in order to improve fault & error handling.
Read full abstract