Abstract

For matrix operations, the algorithm-based fault tolerance (ABFT) brings much lower fault tolerance overhead than the traditional Triple Modular Redundancy or Double Modular Redundancy approaches. Many works have been done to develop and optimize ABFT schemes on general purpose microprocessors. However, the ABFT schemes on heterogeneous systems with GPUs are not fully developed and optimized. Moreover, existing ABFT schemes can correct computing errors brings by the logic parts, however, many memory storage errors cannot be detected and corrected by current ABFT schemes. In this work, we designed a new ABFT scheme with both computing and memory storage protection. Then, we apply it to Cholesky decomposition on heterogeneous systems with GPUs. In addition, we develop several fault tolerance overhead reduction techniques specifically for heterogeneous systems with GPUs accelerators. Experimental results show that our ABFT scheme is able to correct both computing error and memory storage error with low overhead and comparable overall performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.