Abstract
Development of fault-tolerant computing systems requires accurate reliability modeling. Analytic, simulation, and hybrid models are commonly used for obtaining reliability measures. These measures are functions of component failure rates and fault-coverage (probabilities). Coverage provides information about the fault and error detection, isolation, and system recovery capabilities. This parameter can be derived by physical or simulated fault injection. Statistical inference has been used to extract meaningful information from sample observation. The problem of conducting fault injection experiments and statistically inferring the coverage from the information gathered in those experiments is addressed in this paper. We perform statistical experiments in a multi-dimensional space of events. In this way all major factors which influence the coverage (fault locations, timing characteristics of the fault, and the workload) are accounted for. Multi-stage, stratified, and combined multi-stage and stratified sampling are used in this paper for deriving the coverage. Equations of the mean, variance, and confidence interval of the coverage are provided. The statistical error produced by the injected faults which do not induce errors in the tested system (also known as the nonresponse problem) is considered, A program which emulates a typical fault environment was developed and four hypothetical systems are analyzed.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have