Effects Of Soft Errors Research Articles

The latest advances in artificial intelligence deep learning models are unprecedented. A wide spectrum of application areas is now thriving thanks to available massive training datasets and gigantic complex neural network models. Those two characteristics demand outstanding computing power that only advanced computing platforms can provide. Therefore, distributed deep learning has become a necessity in capitalizing on the potential of cutting-edge artificial intelligence. Two basic schemes have emerged in distributed learning. First, the data-parallel approach, which aims at dividing the training dataset into multiple computing nodes. Second, the model-parallel approach, which splits layers of a model into several computing nodes. Each scheme has its upsides and downsides, particularly when running on large machines that are susceptible to soft errors. Those errors occur as a consequence of several factors involved in the manufacturing process of current electronic components of supercomputers. On many occasions, those errors are expressed as bit flips that do not cause the whole system to crash, but generate wrong numerical results in computations. To study the effect of soft error on different approaches for distributed learning, we leverage checkpoint alteration, a technique that injects bit flips on checkpoint files. It allows researchers to understand the effect of soft errors on applications that produce checkpoint files in HDF5 format. This paper uses the popular deep learning PyTorch tool on two distributed-learning platforms: one for data-parallel training and one for model-parallel training. We use well-known deep learning models with popular training datasets to provide a picture of how soft errors challenge the training phase of a deep learning model.

Read full abstract

Extensive research efforts are being carried out to evaluate and improve the reliability of computing devices either through beam experiments or simulation-based fault injection. Unfortunately, it is still largely unclear to which extend fault injection can provide an accurate error rate estimation at early stages and if beam experiments can be used to identify the weakest resources in a device. The importance and challenges associated with a timely, but yet realistic reliability evaluation grow with the increase of complexity in both the hardware domain, with the integration of different types of cores in an SoC (System-on-Chip), and the software domain, with the OS (operating system) required to take full advantage of the available resources. In this paper, we combine and analyze data gathered with extensive beam experiments (on the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">final</i> physical CPU hardware) and microarchitectural fault injections (on <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">early</i> microarchitectural CPU models). We target a standalone Arm Cortex-A5 CPU and an Arm Cortex-A9 CPU integrated into an SoC and evaluate their reliability in bare-metal and Linux-based configurations. Combining experimental data that covers more than 18 million years of device time with the result of more than 176,000 injections we find that both the SoC integration and the presence of the OS increase the system DUEs (Detected Unrecoverable Errors) rate (for different reasons) but do not significantly impact the SDCs (Silent Data Corruptions) rate which is solely attributed to the CPU core. Our reliability analysis demonstrates that even considering SoC integration and OS inclusion, early, pre-silicon microarchitecture-level fault injection delivers accurate SDC rates estimations and lower bounds for the DUE rates.

Read full abstract

Effects Of Soft Errors Research Articles

Related Topics

Articles published on Effects Of Soft Errors

A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning

Learning-Based Mitigation of Soft Error Effects on Quaternion Kalman Filter Processing

Investigation of edge computing hardware architectures processing tiny machine learning under neutron-induced radiation effects

Soft-Error-Aware Radiation-Hardened Ge-DLTFET-Based SRAM Cell Design

Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements

Emulating the Effects of Radiation-Induced Soft-Errors for the Reliability Assessment of Neural Networks

Impact of negative bias temperature instability on single event transients in scaled logic circuits

Soft Error Tolerant Count Min Sketches

A comprehensive analysis on the resilience of adiabatic logic families against transient faults

The Effects of Soft Errors and Mitigation Strategies for Virtualization Servers

Reliable and high performance asymmetric FinFET SRAM cell using back-gate control

32-Bit One Instruction Core: A Low-Cost, Reliable, and Fault-Tolerant Core for Multicore Systems

An ALU Protection Methodology for Soft Processors on SRAM-Based FPGAs

Exploiting Hardware Unobservability for Low-Power Design and Safety Analysis in Formal Verification-Driven Design Flows

Effect of Soft Errors in Iterative Learning Control and Compensation using Cross-layer Approach

Soft Error in Saddle Fin Based DRAM

Half-select free bit-line sharing 12T SRAM with double-adjacent bits soft error correction and a reconfigurable FPGA for low-power applications

Dependability Analysis of Data Storage Systems in Presence of Soft Errors

Correction to “Impact of Microarchitectural Differences of RISC-V Processor Cores on Soft Error Effects”

Design and Implementation of Configuration Memory SEU-Tolerant Viterbi Decoders in SRAM-Based FPGAs

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Effects Of Soft Errors Research Articles

Related Topics

Articles published on Effects Of Soft Errors

A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning

Learning-Based Mitigation of Soft Error Effects on Quaternion Kalman Filter Processing

Investigation of edge computing hardware architectures processing tiny machine learning under neutron-induced radiation effects

Soft-Error-Aware Radiation-Hardened Ge-DLTFET-Based SRAM Cell Design

Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements

Emulating the Effects of Radiation-Induced Soft-Errors for the Reliability Assessment of Neural Networks

Impact of negative bias temperature instability on single event transients in scaled logic circuits

Soft Error Tolerant Count Min Sketches

A comprehensive analysis on the resilience of adiabatic logic families against transient faults

The Effects of Soft Errors and Mitigation Strategies for Virtualization Servers

Reliable and high performance asymmetric FinFET SRAM cell using back-gate control

32-Bit One Instruction Core: A Low-Cost, Reliable, and Fault-Tolerant Core for Multicore Systems

An ALU Protection Methodology for Soft Processors on SRAM-Based FPGAs

Exploiting Hardware Unobservability for Low-Power Design and Safety Analysis in Formal Verification-Driven Design Flows

Effect of Soft Errors in Iterative Learning Control and Compensation using Cross-layer Approach

Soft Error in Saddle Fin Based DRAM

Half-select free bit-line sharing 12T SRAM with double-adjacent bits soft error correction and a reconfigurable FPGA for low-power applications

Dependability Analysis of Data Storage Systems in Presence of Soft Errors

Correction to “Impact of Microarchitectural Differences of RISC-V Processor Cores on Soft Error Effects”

Design and Implementation of Configuration Memory SEU-Tolerant Viterbi Decoders in SRAM-Based FPGAs