Abstract

Multi-core and many-core processors are a promising solution to achieve high performance by maintaining a lower power consumption. However, the degree of miniaturization makes them more sensitive to soft-errors. To improve the system reliability, this work proposes a fault-tolerance approach based on redundancy and partitioning principles called N-Modular Redundancy and M-Partitions (NMR-MPar). By combining both principles, this approach allows multi-/many-core processors to perform critical functions in mixed-criticality systems. Benefiting from the capabilities of these devices, NMR-MPar creates different partitions that perform independent functions. For critical functions, it is proposed that N partitions with the same configuration participate of an N-modular redundancy system. In order to validate the approach, a case study is implemented on the KALRAY Multi-Purpose Processing Array (MPPA)-256 many-core processor running two parallel benchmark applications. The traveling salesman problem and matrix multiplication applications were selected to test different device’s resources. The effectiveness of NMR-MPar is assessed by software-implemented fault-injection. For evaluation purposes, it is considered that the system is intended to be used in avionics. Results show the improvement of the application reliability by two orders of magnitude when implementing NMR-MPar on the system. Finally, this work opens the possibility to use massive parallelism for dependable applications in embedded systems.

Highlights

  • The widespread use of COTS multi-core and many-core processors in most modern computing systems is a consequence of their flexibility, high performance, low-power consumption and intrinsic redundancy

  • It is important to note that the objective of this approach is to reproduce the Single Event Upset (SEU) effects, which are bit-flips in memory cells caused by natural radiation

  • Fault-injection campaigns for both applications are devoted to inject faults in the General Purpose Registers (GPRs) and 15 System Function Registers (SFRs) of the Processing Engines (PEs) belonging to the compute clusters

Read more

Summary

Introduction

The widespread use of COTS multi-core and many-core processors in most modern computing systems is a consequence of their flexibility, high performance, low-power consumption and intrinsic redundancy. Avionics and spacecraft industries are interested in validating the use of these components for their applications (e.g. international teams are working on this purpose: Certified Associate in Software Testing (CAST), European Aviation Safety Agency (EASA) and Reduced certification Costs for trusted Multi-core Platforms (RECOMP) projects). This allows a system to continue working properly in case one or more components fail. The TMR [33] is a well-known fault-tolerant technique that implements triplication of modules and majority voting This technique has been largely used to avoid errors in integrated circuits, but imposes very high hardware overhead. This method becomes more attractive when implemented on multi-/many-core processor architectures

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call