Reverse computation for rollback-based fault tolerance in large parallel systems

Kalyan S Perumalla,Alfred J Park

doi:10.1007/s10586-013-0277-4

Abstract

Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle (ideal gas) simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Reverse computation for rollback-based fault tolerance in large parallel systems

Abstract

Talk to us

Similar Papers

More From: Cluster Computing

Lead the way for us

Journal: Cluster Computing	Publication Date: Jun 8, 2013
Citations: 36

Similar Papers

Fault tolerant matrix operations using checksum and reverse computation
Youngbae Kim ... J.S Plank
-
Youngbae Kim, et. al. Youngbae Kim ... J.S Plank
27 Mar 1996
27 Mar 1996

Fault-tolerant disk storage and file systems using reflective memory
N Vekiarides
-
N VekiaridesN Vekiarides
04 Jan 1995
04 Jan 1995

Analysis and optimization of storage IO in distributed and massive parallel high performance systems
...
-
, et. al. ...
01 Jan 2010
01 Jan 2010

ROSS: A high-performance, low-memory, modular Time Warp system
Christopher D Carothers ... Shawn Pearce
Journal of Parallel and Distributed Computing | VOL. 62
Christopher D Carothers, et. al.Christopher D Carothers ... Shawn Pearce
01 Nov 2002
Journal of Parallel and Distributed Computing | VOL. 62

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reverse computation for rollback-based fault tolerance in large parallel systems

Abstract

Talk to us

Similar Papers

More From: Cluster Computing