Abstract

Traditional fault-tolerance techniques relying on spatial and temporal redundancy typically imply high power, delay, and area overheads. Cost-effective solutions often depend on system’s design and hardware platform at hand. Particularly for Field-Programmable Gate Arrays (FPGAs), soft errors on the configuration memory are a significant dependability threat. In this work, we present an extended and comprehensive fault tolerance mechanism especially suited for dealing with configuration faults on FPGA-based systems that must deal multiple failure modes. Each failure mode may present different criticality and probability of occurrence, and these properties are measured and exploited to provide low-cost solutions when compared to standard approaches such as triple modular redundancy. The exploited properties are typically found in critical monitoring systems that may trigger security- or safety-critical alarms and warnings in general. In such systems, failing to trigger an alarm when necessary is frequently regarded as more critical than providing an occasional false alarm. For instance, Regular Expression Matching (REM), a compute-intensive mechanism heavily used to perform Deep Packet Inspection in critical network applications, presents such properties, and it can be greatly accelerated by FPGAs to meet performance constraints in high-throughput networks. Therefore, we use FPGA-based REM engines as a case study to demonstrate the effectiveness of the proposed techniques. Additionally, a mutually-aware placement and scrubbing mechanism is introduced to reduce the repair time, improving the system reliability and availability. Experimental results show that the failure rate and the repair time can be reduced by 95 and 90% respectively while avoiding the costs of triplication.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call