SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

Diego Montezanti,Marcelo Naiouf,Emilio Luque,Armando De Giusti,Fernando Emmanuel Frati,Dolores Rexachs

doi:10.19153/cleiej.15.3.5

Diego Montezanti, Marcelo Naiouf + Show 4 more

Open Access

https://doi.org/10.19153/cleiej.15.3.5

Copy DOI

Abstract

   The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.  

Highlights

The challenge of improving the computation performance of current processors has been achieved by increasing integration scale, which implies that the number of transistors within chips is growing
SMCV (Sent Message Content Validation) is presented, which is a proposal designed for the detection of transient faults in scientific, message-passing parallel applications that execute on the nodes
SMCV is a distributed strategy that improves the reliability of the system, isolating the error produced in the context of an application process and preventing it from propagating to the others

Summary

Introduction

The challenge of improving the computation performance of current processors has been achieved by increasing integration scale, which implies that the number of transistors within chips is growing. The impact of the faults becomes more significant in the case of longer applications, given the high cost of relaunching execution from the beginning These factors justify the need for a set of strategies to improve the reliability of high-performance computation systems. Hardware-based techniques [8,9,11,13] aim to protect the various elements in the processor by adding additional logics to provide redundancy These are most widely used in critical environments, such as flight systems or high-availability servers, where the consequences of a transient fault can be disastrous. There are numerous proposals for detection, based on duplication, designed for serial programs, whose purpose is ensuring execution reliability From this standpoint, a parallel application can be viewed as a set of sequential processes that have to be protected from the consequences of transient faults by means of the set of adopted techniques. SMCV (Sent Message Content Validation) is presented, which is a proposal designed for the detection of transient faults in scientific, message-passing parallel applications that execute on the nodes

ECC: Error Correcting Code 2 HPC: High Performance Computing 3 DMR

Background

Transient Faults in Message Passing Parallel Applications

Related Work

Validating Contents of Sent Messages

Leveraging Redundant Hardware Resources

Proposed Methodology Description

Characterizing SMCV’s Additional Workload

Testing SMCV’s Effectiveness

Overhead Measurements

Future Work

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: CLEI Electronic Journal	Publication Date: Dec 1, 2012
Citations: 22	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: CLEI Electronic Journal

Lead the way for us

Similar Papers

Parallel network RAM: effectively utilizing global cluster memory for large data-intensive parallel programs
...
-
, et. al. ...
15 Aug 2004
15 Aug 2004

Effectively utilizing global cluster memory for large data-intensive parallel programs
J Oleszkiewicz ... L Xiao
IEEE Transactions on Parallel and Distributed Systems | VOL. 17
J Oleszkiewicz, et. al.J Oleszkiewicz ... L Xiao
01 Jan 2006
IEEE Transactions on Parallel and Distributed Systems | VOL. 17

Parallel network RAM: effectively utilizing global cluster memory for large data-intensive parallel programs
J Oleszkiewicz ... Li Xiao
-
J Oleszkiewicz, et. al.J Oleszkiewicz ... Li Xiao
01 Jan 2004
01 Jan 2004

Performance prediction and race detection in message-passing parallel applications

-

01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: CLEI Electronic Journal