Effective error-detection is paramount for building highly dependable computing systems. A new methodology, based on physical and simulated fault injection, has been developed for assessing the effectiveness of error-detection mechanisms. This approach has 2 steps: (1) transient faults are physically injected at the IC pin level of a prototype, in order to derive the error-detection coverage. Experiments are carried out in a 3-dimensional space of events. Fault location, time of occurrence, and duration of the injected fault are the dimensions of this space. (2) Simulated fault-injection is performed to assess the effectiveness of new error-detection mechanisms, designed to improve the detection coverage. Complex circuitry, based on checking for protocol violations, is considered. A temporal model of the protocol checker is used, and transient faults are injected in signal traces captured from the prototype system. These traces are used as inputs of the simulation engine. s-confidence intervals of the error-detection coverage are derived, both for the initial design and the new detection mechanism. Physical fault-injection, carried out on a prototype server, proved that several signals were sensitive to transient faults and error-detection coverage was unacceptably low. Simulated fault injection shows that an error-detection mechanism, based on checking for protocol violations, can appreciably increase the detection coverage, especially for transient faults longer that 200 nanoseconds. Additional research is required for improving the error-detection of shorter transients. Fault injection experiments also show that error-detection coverage is a function of fault duration: the shorter the transient fault, the lower the coverage. As a consequence, injecting faults that have a unique, predefined duration, as it was frequently done in the past, does not provide accurate information on the effectiveness of the error-detection mechanisms. Injecting only permanent faults leads to unrealistically high estimates of the coverage. These experiments prove that combined physical and simulated fault injection, performed in a 3-dimensional space of events, is a superior approach, which allows the designers to accurately assess the efficacy of various candidate error-detection mechanisms without building expensive test circuits.
Read full abstract