Fault tolerance at system level based on RADIC architecture

Marcela Castro-León,Hugo Meyer,Dolores Rexachs,Emilio Luque

doi:10.1016/j.jpdc.2015.08.005

Abstract

The increasing failure rate in High Performance Computing encourages the investigation of fault tolerance mechanisms to guarantee the execution of an application in spite of node faults. This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries. The model consists of detecting failures in the communication socket caused by a faulty node. In those cases, the affected processes are recovered in a healthy node and the connections are reestablished without losing data. The Redundant Array of Distributed Independent Controllers architecture proposes a decentralized model for all the tasks required in a fault tolerance system: protection, detection, recovery and masking. Decentralized algorithms allow the application to scale, which is a key property for current HPC system. Three different rollback recovery protocols are defined and discussed with the aim of offering alternatives to reduce overhead when multicore systems are used. A prototype has been implemented to carry out an exhaustive experimental evaluation through Master/Worker and Single Program Multiple Data execution models. Multiple workloads and an increasing number of processes have been taken into account to compare the above mentioned protocols. The executions take place in two multicore Linux clusters with different socket communications libraries.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Parallel and Distributed Computing	Publication Date: Aug 28, 2015
Citations: 41	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Fault tolerance at system level based on RADIC architecture

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing

Lead the way for us

Similar Papers

FRASystem: fault tolerant system using agents in distributed computing systems
Hwamin Lee ... Giyeol Lee
Cluster Computing | VOL. 14
Hwamin Lee, et. al.Hwamin Lee ... Giyeol Lee
17 Jul 2009
Cluster Computing | VOL. 14

Fault tolerance in distributed power systems
R.V White
-
R.V WhiteR.V White
16 Oct 1995
16 Oct 1995

Towards a state synchronization methodology for recovery process after partial reconfiguration of fault tolerant systems
Karel Szurman ... Lukas Miculka
-
Karel Szurman, et. al.Karel Szurman ... Lukas Miculka
01 Dec 2014
01 Dec 2014

Practical considerations in the design of power system architectures for fault tolerant systems
R.E Johnson
-
R.E JohnsonR.E Johnson
14 Oct 2001
14 Oct 2001

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fault tolerance at system level based on RADIC architecture

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing