Prediction-Based Error Correction for GPU Reliability with Low Overhead

Hyunyul Lim,Tae Hyun Kim,Sungho Kang

doi:10.3390/electronics9111849

Hyunyul Lim, Tae Hyun Kim + Show 1 more

Open Access

https://doi.org/10.3390/electronics9111849

Copy DOI

Journal: Electronics	Publication Date: Nov 5, 2020
Citations: 1	License type: CC BY 4.0

Affiliation: Yonsei University

Abstract

Scientific and simulation applications are continuously gaining importance in many fields of research and industries. These applications require massive amounts of memory and substantial arithmetic computation. Therefore, general-purpose computing on graphics processing units (GPGPU), which combines the computing power of graphics processing units (GPUs) and general CPUs, have been used for computationally intensive scientific and big data processing applications. Because current GPU architectures lack hardware support for error detection in computation logic, GPGPU has low reliability. Unlike graphics applications, errors in GPGPU can lead to serious problems in general-purpose computing applications. These applications are often intertwined with human life, meaning that errors can be life threatening. Therefore, this paper proposes a novel prediction-based error correction method called Prediction-based Error Correction (PRECOR) for GPU reliability, which detects and corrects errors in GPGPU platforms with a focus on errors in computational elements. The implementation of the proposed architecture needs a small number of checkpoint buffers in order to fix errors in computational logic. The PRECOR architecture has prediction buffers and controller units for predicting erroneous outputs before performing rollback. Following a rollback, the architecture confirms the accuracy of its predictions. The proposed method effectively reduces the hardware and time overheads required to correct errors. Experimental results confirm that PRECOR efficiently fixes errors with low hardware and time overheads.

Highlights

High-performance computing (HPC) applications typically require massive amounts of memory and a huge number of arithmetic computations
Because big data processing applications have become increasingly intertwined with humans, the reliability of GPUs designed for general applications has become increasingly important, and such approaches are referred to as general-purpose computing on graphics processing units (GPGPU)
The experimental results are described in comparison fault coverage of Prediction-based Error Correction (PRECOR)

Summary

Introduction

High-performance computing (HPC) applications typically require massive amounts of memory and a huge number of arithmetic computations. Special accelerators and processors have been proposed to achieve massive parallel computing power [1,2,3,4]. These accelerators and processors are very expensive to manufacture and cannot be used for general purposes. Graphics processing units (GPUs) contain a huge number of computation and memory units. Due to their highly parallel structure, recent GPU researches have focused on general-purpose applications in the high-performance computing (HPC) field [5]. Because big data processing applications have become increasingly intertwined with humans, the reliability of GPUs designed for general applications has become increasingly important, and such approaches are referred to as general-purpose computing on graphics processing units (GPGPU)

Methods

Results

Conclusion