Abstract
Low-power, high-performance, System-on-Chip (SoC) devices, such as the NVIDIA Tegra K1 and Tegra X1, have many potential uses in aerospace applications. Fusing ARM CPUs and a large GPU, Tegra SoCs are well suited for image and signal processing. However, fault masking and tolerance on GPUs is relatively unexplored for harsh environments. With hundreds of GPU cores, a complex caching structure, and a custom task scheduler, Tegra SoCs are vulnerable to a wide range of single-event upsets (SEUs). Triple-modular redundancy (TMR) provides a strong basis for fault masking on a wide range of devices. GPUs pose a unique challenge to a typical TMR implementation. NVIDIA's scheduler assigns tasks based on available resources, but the scheduling process is not publicly documented. As a result, a malfunctioning core could be assigned the same block of code in each TMR module. In this case, a fault could go undetected, impacting the resulting data with an error. Likewise, an upset in the scheduler or cache could have an adverse impact on data integrity. In order to mask and mitigate upsets in GPUs, we propose and investigate a new method that features persistent threading and CUDA Streams with TMR. A persistent thread is a new approach to GPU programming where a kernel's threads run indefinitely. CUDA Streams enable multiple kernels to run concurrently on a single GPU. Combining these two programming paradigms, we remove the vulnerability of scheduler faults, and ensure that each iteration is executed concurrently on different cores, with each instance having its own copy of the data. We evaluate our method with an experiment that uses a Sobel filter applied to a 640×480 image on an NVIDIA Tegra X1. In order to inject faults to verify our method, a separate task corrupts a memory location. Using this simple injector, we are able to simulate an upset in a GPU core or memory location. From this experiment, our results confirm that using persistent threading and CUDA Streams with TMR masks the simulated SEUs on the Tegra X1. Furthermore, we provide performance results to quantify the overhead with this new method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.