Abstract


 
 
 The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.
 
 

Highlights

  • The challenge of improving the computation performance of current processors has been achieved by increasing integration scale, which implies that the number of transistors within chips is growing

  • SMCV (Sent Message Content Validation) is presented, which is a proposal designed for the detection of transient faults in scientific, message-passing parallel applications that execute on the nodes

  • SMCV is a distributed strategy that improves the reliability of the system, isolating the error produced in the context of an application process and preventing it from propagating to the others

Read more

Summary

Introduction

The challenge of improving the computation performance of current processors has been achieved by increasing integration scale, which implies that the number of transistors within chips is growing. The impact of the faults becomes more significant in the case of longer applications, given the high cost of relaunching execution from the beginning These factors justify the need for a set of strategies to improve the reliability of high-performance computation systems. Hardware-based techniques [8,9,11,13] aim to protect the various elements in the processor by adding additional logics to provide redundancy These are most widely used in critical environments, such as flight systems or high-availability servers, where the consequences of a transient fault can be disastrous. There are numerous proposals for detection, based on duplication, designed for serial programs, whose purpose is ensuring execution reliability From this standpoint, a parallel application can be viewed as a set of sequential processes that have to be protected from the consequences of transient faults by means of the set of adopted techniques. SMCV (Sent Message Content Validation) is presented, which is a proposal designed for the detection of transient faults in scientific, message-passing parallel applications that execute on the nodes

ECC: Error Correcting Code 2 HPC: High Performance Computing 3 DMR
Background
Transient Faults in Message Passing Parallel Applications
Related Work
Validating Contents of Sent Messages
Leveraging Redundant Hardware Resources
Proposed Methodology Description
Characterizing SMCV’s Additional Workload
Testing SMCV’s Effectiveness
Overhead Measurements
Future Work
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.