Improving batch schedulers with node stealing for failed jobs

Yishu Du,Loris Marchal,Guillaume Pallez,Yves Robert

doi:10.1002/cpe.8043

Abstract

SummaryAfter a machine failure, batch schedulers typically re‐schedule the job that failed with a high priority. This is fair for the failed job but still requires that job to re‐enter the submission queue and to wait for enough resources to become available. The waiting time can be very long when the job is large and the platform highly loaded, as is the case with typical HPC platforms. We propose another strategy: when a job fails, if no platform node is available, we steal one node from another job , and use it to continue the execution of despite the failure. In this work, we give a detailed assessment of this node stealing strategy using traces from the Mira supercomputer at Argonne National Laboratory. The main conclusion is that node stealing improves the utilization of the platform and dramatically reduces the flow of large jobs, at the price of slightly increasing the flow of small jobs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving batch schedulers with node stealing for failed jobs

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience

Lead the way for us

Journal: Concurrency and Computation: Practice and Experience	Publication Date: Feb 16, 2024
License type: cc-by

Similar Papers

Numerical modeling of laser-driven experiments aiming to demonstrate magnetic field amplification via turbulent dynamo
...
Physics of Plasmas | VOL. 24
, et. al. ...
22 Mar 2017
Physics of Plasmas | VOL. 24

K-mer Counting for Genomic Big Data
Jianqiu Ge ... Jiaxiu Zhou
-
Jianqiu Ge, et. al.Jianqiu Ge ... Jiaxiu Zhou
01 Jan 2018
01 Jan 2018

U.S. job flows and the China shock
Brian Asquith ... Antonio Rodriguez-Lopez
Journal of International Economics | VOL. 118
Brian Asquith, et. al.Brian Asquith ... Antonio Rodriguez-Lopez
23 Feb 2019
Journal of International Economics | VOL. 118

U.S. Job Flows and the China Shock
Brian Asquith ... Antonio Rodriguez-Lopez
-
Brian Asquith, et. al.Brian Asquith ... Antonio Rodriguez-Lopez
01 Nov 2017
01 Nov 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving batch schedulers with node stealing for failed jobs

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience