Radiation-Tolerant Deep Learning Processor Unit (DPU)-Based Platform Using Xilinx 20-nm Kintex UltraScale FPGA

Pierre Maillard,Minal Sawant,Nicholas Fraser,Yanran P Chen,Martin L Voogel,Jason Vidmar,Giulio Gambardella

doi:10.1109/tns.2022.3216360

Abstract

This paper presents a platform and design approach for enabling radiation-tolerant deep learning acceleration on SRAM-based 20nm Kintex UltraScale™ FPGAs, for terrestrial and high-radiation environments. The presented platform is suitable for deep neural network (DNN) implementations with an emphasis on image classification and includes solutions to mitigate both radiation-induced Single Event Functional Interrupts (SEFIs) and network datapath corruptions. The radiation-tolerant deep learning platform combines Xilinx’s Deep Learning Processing Unit (DPU) IP, Triple Modular Redundancy (TMR) MicroBlaze soft processor IP and Soft Error Mitigation (SEM)-IP to mitigate SEFIs. Furthermore, a technique known as Fault Aware Training (FAT) was applied to effectively mitigate single event effects in the datapath. Test results from a high-energy proton beam (> 60 MeV) experiment using the ResNet-18 Convolutional Neural Network (CNN) for image classification are presented. The Single Event Upset (SEU) rate, system-level SEFI rate and neural network classification/datapath performance are compared between the radiation-tolerant platform and a standard, non-mitigated approach. Results show that datapath classification errors dominate the system response (90%) vs. SEFIs (10%). When compared to standard non-mitigated training techniques, the radiation-tolerant platform using fault aware training methods shows dramatic improvements in overall system response: the overall single event cross-section was reduced by half and 40% reduction in misclassification errors were observed. Also, datapath events with classification accuracy degradation larger than 5% were completely mitigated. The SEFI rate was reduced by 100X with implemented solutions and can be further reduced by optimizing the physical separation between TMR modules.

Full Text