Job migration in HPC clusters by means of checkpoint/restart

Manuel Rodríguez-Pascual,Rafael Mayo-García,José A Moríñigo,Jiajun Cao,Gene Cooperman

doi:10.1007/s11227-019-02857-y

Job migration in HPC clusters by means of checkpoint/restart

Manuel Rodríguez-Pascual, Rafael Mayo-García + Show 3 more

Open Access

https://doi.org/10.1007/s11227-019-02857-y

Copy DOI

Journal: The Journal of Supercomputing	Publication Date: Apr 23, 2019
Citations: 11

Affiliation: Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas, Northeastern University

#HPC Clusters #Job Running + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.

Full Text