Power-Aware Checkpointing for Multicore Embedded Systems

Mohsen Ansari,Heba Khdr,Jorg Henkel,Shaahin Hessabi,Sepideh Safari,Alireza Ejlali,Pourya Gohari-Nazari

doi:10.1109/tpds.2022.3188568

Abstract

Increasing the number of cores integrated on a single chip offers a great potential for the implementation of fault-tolerant techniques to achieve high reliability in real-time embedded systems. Checkpointing with rollback-recovery is a well-established technique to tolerate transient faults in multicore platforms. To consider the worst-case fault occurrence scenario, checkpointing technique requires to re-execute some parts of the tasks, and that might lead to simultaneous execution of task parts with high power consumptions, which eventually might result in a peak power increase beyond the thermal design power (TDP). Exceeding TDP can elevate on-chip temperatures beyond safe limits, and thereby triggering countermeasures that throttle down the voltage and frequency levels or power gate the cores. Such countermeasures might lead to violating task deadlines and degrading the system's reliability. To avoid such severe scenarios, it is inevitable to consider the impact of applying fault-tolerant techniques on the power consumption and prevent violating the power constraint of the chip, i.e., TDP. This paper presents for the first time, a peak-power-aware checkpointing (PPAC) technique that tolerates a given number of faults, <i>k</i>, while at the same time meets the power constraint in hard real-time embedded systems. To do this, our proposed technique (PPAC) adjusts the timing of the checkpoints, which have lower power consumption than the tasks to the execution time points that have power spikes beyond TDP. Moreover, PPAC exploits the available slack times on the cores to delay the execution of some tasks to avoid the remaining power spikes beyond TDP, which could not be mitigated by solely adjusting checkpoints. To evaluate our technique, we extend the state-of-the-art system-level simulator, gem5, with the state-of-the-art checkpointing module in Linux. Our experimental results show that our proposed technique is able to tolerate a given number of faults without exceeding the timing and power constraints in hard real-time embedded systems. The resulting peak power reduction achieved by our technique compared to state-of-the-art techniques is an average of 23%. Moreover, our technique employs the Dynamic Power Management (DPM) during the slack times resulting at runtime in the case of fault-free scenarios, which provides energy savings with an average of 17.28% and up to 61.1%.

Full Text