Mira Supercomputer Research Articles

This paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or the makespan. We revisit the classical problem while assuming that jobs are subject to failures caused by transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in this framework, list scheduling that gives priority to the longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. The paper focuses on the design of several heuristics, some list-based and some shelf-based, along with different priority rules and backfilling strategies. We assess and compare their performance through an extensive set of simulations using both synthetic jobs and log traces from the Mira supercomputer.

The universe is permeated by magnetic fields, with strengths ranging from a femtogauss in the voids between the filaments of galaxy clusters to several teragauss in black holes and neutron stars. The standard model behind cosmological magnetic fields is the nonlinear amplification of seed fields via turbulent dynamo to the values observed. We have conceived experiments that aim to demonstrate and study the turbulent dynamo mechanism in the laboratory. Here, we describe the design of these experiments through simulation campaigns using FLASH, a highly capable radiation magnetohydrodynamics code that we have developed, and large-scale three-dimensional simulations on the Mira supercomputer at the Argonne National Laboratory. The simulation results indicate that the experimental platform may be capable of reaching a turbulent plasma state and determining the dynamo amplification. We validate and compare our numerical results with a small subset of experimental data using synthetic diagnostics.

Mira Supercomputer Research Articles

Articles published on Mira Supercomputer

Improving batch schedulers with node stealing for failed jobs

Resilient Scheduling Heuristics for Rigid Parallel Jobs

HPC Opens a New Frontier in Fuel-Engine Research

Numerical modeling of laser-driven experiments aiming to demonstrate magnetic field amplification via turbulent dynamo

Argonne Discovery Yields Self-Healing Diamond-Like Carbon

Optimising the Termofluids CFD code for petascale simulations

Delivering Science on Day One

Asynchronous Two-level Checkpointing Scheme for Large-scale Adjoints in the Spectral-Element Solver Nek5000

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Mira Supercomputer Research Articles

Articles published on Mira Supercomputer

Improving batch schedulers with node stealing for failed jobs

Resilient Scheduling Heuristics for Rigid Parallel Jobs

HPC Opens a New Frontier in Fuel-Engine Research

Numerical modeling of laser-driven experiments aiming to demonstrate magnetic field amplification via turbulent dynamo

Argonne Discovery Yields Self-Healing Diamond-Like Carbon

Optimising the Termofluids CFD code for petascale simulations

Delivering Science on Day One

Asynchronous Two-level Checkpointing Scheme for Large-scale Adjoints in the Spectral-Element Solver Nek5000