Abstract

The current trend in commercial processors is producing multi-core architectures which pose both an opportunity and a challenge for future space based processing. The opportunity is how to leverage multi-core processors for high intensity computing applications and thus provide an order of magnitude increase in onboard processing capability with less size, mass, and power. The challenge is to provide the requisite safety and reliability in an extremely challenging radiation environment. The objective is to advance from multiple single processor systems typically flown to a fault tolerant multi-core system. Software based methods for multi-core processor fault tolerance to single event effects (SEEs) causing interrupts or ‘bit-flips’ are investigated and we propose to utilize additional cores and memory resources together with newly developed software protection techniques. This work also assesses the optimal trade space between reliability and performance. Our work is based on the modern compiler “LLVM” as it is ported to many architectures, where we implement optimization passes that enable automatic addition of protection techniques including N-modular redundancy (NMR) and error detection and correction (EDAC) at assembly/instruction level to languages supported. The optimization passes modify the intermediate representation of the source code meaning it could be applied for any high level language, and any processor architecture supported by the LLVM framework. In our initial experiments, we implement separately triple modular redundancy (TMR) and error detection and correction codes including (Hamming, BCH) at instruction level. We combine these two methods for critical applications, where we first TMR our instructions, and then use EDAC as a further measure, when TMR is not able to correct the errors originating from the SEE. Our initial experiments show good performance (about 10% overhead) when protecting the memory of code using double error detection single error correction hamming code and TMR (Triple modular redundancy), further work is needed to improve the performance when protecting the memory of code using the BCH code. This work would be highly valuable, both to satellites/space but also in general computing such as in in aircraft, automotive, server farms, and medical equipment (or anywhere that needs safety critical performance) as hardware gets smaller and more susceptible.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call