Abstract

Fault-tolerance has become an essential concern for processor designers due to increasing transient fault rates, even for the processors used in the mainstream computing. As the mainstream commodity market accepts only low-cost fault tolerance solutions, traditional high-end solutions are unacceptable due to their expensive overheads. This paper presents EnHTM, a hybrid software/hardware implemented low-cost fault tolerance solution for the serial programs running on commodity systems. EnHTM employs light-weight symptom-based mechanism to detect faults and recovers from faults using a minimally-modified Hardware Transactional Memory (HTM) which features lazy conflict detection, lazy data versioning. Compile-time analysis approach is also exploited to support larger transaction size, so that transient faults detected within long latency can be recovered. The evaluation experiment result shows that EnHTM can recover from 89.4%of catastrophic failures caused by transient faults, with a performance overhead of 2.6% in error-free executions on average.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call