Evaluation of process level redundant checkpointing/restart for HPC systems

Ifeanyi P Egwutuoha,Bran Selic,David Levy

doi:10.1109/pccc.2011.6108098

Evaluation of process level redundant checkpointing/restart for HPC systems

Ifeanyi P Egwutuoha, Bran Selic + Show 1 more

https://doi.org/10.1109/pccc.2011.6108098

Copy DOI

Export

Save

Cite

Publication Date: Nov 1, 2011

Affiliation: University of Sydney

#High Performance Computing Systems #High Performance Computing #Fault Tolerance Mechanisms #Clusters Of Personal Computers #Advantage Of Cost #Performance Benefits #Performance Computing Systems #Clusters Of Computers #Cost Benefits #Reliability Of System

Abstract
Full-Text
Similar Papers

Abstract

Listen

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel custom architectures to clusters of commodity personal computers to take advantage of cost and performance benefits. To avoid having to restart an application in case of sudden failure, checkpointing/restart fault tolerance mechanisms are commonly implemented. One drawback to checkpointing/restart is that it creates an overhead which increases the execution time of an application. We present a theoretical analysis of our technique. The results show that the PLR checkpointing/restart can significantly improve the overall reliability of an HPC system.

Full Text

Published Version

Check institute access

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.

R Discovery Prime

Evaluation of process level redundant checkpointing/restart for HPC systems

Abstract

Published Version

Talk to us

Similar Papers

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Evaluation of process level redundant checkpointing/restart for HPC systems

Abstract

Published Version

Talk to us

Similar Papers