Abstract

The occurrence of radiation-induced soft errors in electronic computing systems can either affect non-essential system functionalities or violate safety–critical conditions, which might incur life-threatening situations. To reach high safety standard levels, reliability engineers must be able to explore and identify efficient mitigation solutions to reduce the occurrence of soft errors at the initial design cycle. This paper presents SOFIA, a framework that integrates: (i) a set of fault injection techniques that enable bespoke inspections, (ii) machine learning methods to correlate soft error results and system architecture parameters, and (iii) mitigation techniques, including: full and partial triple modular redundancy (TMR) as well as a register allocation technique (RAT), which allocates the critical code (e.g., application’s function, machine learning layer) to a pool of specific processor registers. The proposed framework and novel variations of the RAT are validated through more than 1739k fault injections considering a real Linux kernel, benchmarks from different domains and a multi-core Arm processor.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call