Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery

Server Kasap,Eduardo Weber Wächter,Xiaojun Zhai,Shoaib Ehsan,Klaus D Mcdonald-Maier

doi:10.1016/j.microrel.2021.114297

Abstract

All-Programmable System-on-Chips (APSoCs) constitute a compelling option for employing applications in radiation environments thanks to their high-performance computing and power efficiency merits. Despite these advantages, APSoCs are sensitive to radiation like any other electronic device. Processors embedded in APSoCs, therefore, have to be adequately hardened against ionizing-radiation to make them a viable choice of design for harsh environments. This paper proposes a novel lockstep-based approach to harden the dual-core ARM Cortex-A9 processor in the Xilinx Zynq-7000 APSoC against radiation-induced soft errors by coupling it with a MicroBlaze TMR subsystem in the programmable logic (PL) layer of the Zynq. The proposed technique uses the concepts of checkpointing along with roll-back and roll-forward mechanisms at the software level, i.e. software redundancy, as well as processor replication and checker circuits at the hardware level (i.e. hardware redundancy). Results of fault injection experiments show that the proposed approach achieves high levels of protection against soft errors by mitigating around 98% of bit-flips injected into the register files of both ARM cores while keeping timing performance overhead as low as 25% if block and application sizes are adjusted appropriately. Furthermore, the incorporation of the roll-forward recovery operation in addition to the roll-back operation improves the Mean Workload between Failures (MWBF) of the system by up to ≈19% depending on the nature of the running application, since the application can proceed faster, in a scenario where a fault occurs, when treated with the roll-forward operation rather than roll-back operation. Thus, relatively more data can be processed before the next error occurs in the system.

Highlights

Cleaning up the legacy of nuclear waste is one of Europe's most critical and complicated environmental remediation projects, which is expected to cost as much as £220bn over the 120 years [1]
Another conclusion can be drawn from Table 7; as more code execute blocks are employed within an application, that is to say, as the appli cation size grows, timing performance overheads of triple-core lockstep (TCLS) design become more favourable for the given matrix size
Time over heads associated with TCLS design may not suit some hard real-time systems, these overheads would be tolerable for many systems requiring high reliability and dependability under harsh environments once block and application sizes are appropriately adjusted through trial and error based on the nature of the given application program

Summary

Introduction

Cleaning up the legacy of nuclear waste is one of Europe's most critical and complicated environmental remediation projects, which is expected to cost as much as £220bn over the 120 years [1]. Many mission-critical applica tions could have been implemented in All-Programmable Systems-onChips (APSoCs) which combine programmable logic (PL) layer (i.e. SRAM-based FPGA layer) with embedded processors in the processor subsystem (PS) layer Such APSoCs enjoy the merits of higher perfor mance, lower energy consumption, and favourable time-to-market and cost [5]. These highly-integrated circuits, which involve a set of homogeneous or heterogeneous processor cores, are very sus ceptible to transient faults that might even lead to total system failures. Experiments indicate that the TCLS approach applied to the dual-core ARM Cortex-A9 processor embedded in Xilinx Zynq-7000 APSoC is capable of mitigating around 98% of the bit-flips injected while keeping the timing performance overhead as low as 25%, when certain condi tions are satisfied, under fault-free conditions.

Background

Radiation effects on electronics

Effects of soft errors in processors

Fault-tolerance techniques

Lockstep technique

Proposed triple-core lockstep technique

Architecture

Methodology

Interrupt implementation

Consistency check and checkpoint implementations

Fault injection technique

Implementation and experimental results

Resource consumption analysis

Timing performance analysis for matrix multiplication benchmarks

Fault-injection performance analysis for matrix-multiplication benchmarks

Fault-injection performance analysis for 256-bit AES encryption benchmarks

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Microelectronics and Reliability	Publication Date: Aug 5, 2021
Citations: 10	License type: cc-by

R Discovery Prime

R Discovery Prime

Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Microelectronics and Reliability

Lead the way for us

Similar Papers

Novel Lockstep-based Approach with Roll-back and Roll-forward Recovery to Mitigate Radiation-Induced Soft Errors
Server Kasap ... Shoaib Ehsan
-
Server Kasap, et. al.Server Kasap ... Shoaib Ehsan
27 Oct 2020
27 Oct 2020

Radiation-induced soft errors in advanced semiconductor technologies
R.C Baumann
IEEE transactions on device and materials reliability : a publication of the IEEE Electron Devices Society and the IEEE Reliability Society | VOL. 5
R.C BaumannR.C Baumann
01 Sep 2005
IEEE transactions on device and materials reliability : a publication of the IEEE Electron Devices Society and the IEEE Reliability Society | VOL. 5

Survey of Lockstep based Mitigation Techniques for Soft Errors in Embedded Systems
Eduardo Weber Wachter ... Klaus Mcdonald-Maier
-
Eduardo Weber Wachter, et. al.Eduardo Weber Wachter ... Klaus Mcdonald-Maier
01 Sep 2019
01 Sep 2019

An Adaptive and High Coding Rate Soft Error Correction Method in Network-on-Chips
Khanh N Dang ... Xuan-Tu Tran
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 35
Khanh N Dang, et. al.Khanh N Dang ... Xuan-Tu Tran
02 Jun 2019
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Microelectronics and Reliability