Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms

D Cameron,J Elmsheuser,W Lavrijsen,V Tsulaia,P Nilsson,L Heinrich,M Vogel

doi:10.1088/1742-6596/1085/3/032028

Abstract

Data processing applications of the ATLAS experiment, such as event simulation and reconstruction, spend considerable amount of time in the initialization phase. This phase includes loading a large number of shared libraries, reading detector geometry and condition data from external databases, building a transient representation of the detector geometry and initializing various algorithms and services. In some cases the initialization step can take as long as 10-15 minutes. Such slow initialization has a significant negative impact on overall CPU efficiency of the production job, especially when the job is executed on opportunistic, often short-lived, resources such as commercial clouds or volunteer computing. In order to improve this situation, we can take advantage of the fact that ATLAS runs large numbers of production jobs with similar configuration parameters (e.g. jobs within the same production task). This allows us to checkpoint one job at the end of its configuration step and then use the generated checkpoint image for rapid startup of thousands of production jobs. By applying this technique we can bring the initialization time of a job from tens of minutes down to just a few seconds. In addition to that we can leverage container technology for restarting checkpointed applications on the variety of computing platforms, in particular of platforms different from the one on which the checkpoint image was created.We will describe the mechanism of creating checkpoint images of Geant4 simulation jobs with AthenaMP (the multi-process version of the ATLAS data simulation, reconstruction and analysis framework Athena) and the usage of these images for running ATLAS Simulation production jobs on volunteer computing resources (ATLAS@Home) and on Supercomputers.

Highlights

The ATLAS Experiment [1] processes its data at over 140 computing centers around the world using more than 4 million CPU-hours/day
Data processing applications of the ATLAS experiment, such as event simulation and reconstruction, spend considerable amount of time in the initialization phase. This phase includes loading a large number of shared libraries, reading detector geometry and condition data from external databases, building a transient representation of the detector geometry and initializing various algorithms and services
In this paper we describe preliminary results obtained on two platforms: Volunteer Computing (ATLAS@Home [7]) and the Intel Knights Landing (KNL) supercomputer Cori at the National Energy Research Scientific Computing Center (NERSC), Berkeley, USA

Summary

Introduction

The ATLAS Experiment [1] processes its data at over 140 computing centers around the world using more than 4 million CPU-hours/day. Even at such a massive scale, the data processing of the experiment is resource-limited. In view of the steady demand for new computing resources, it becomes very important for the experiment to efficiently.

Published under licence by IOP Publishing Ltd

Uncompressed image

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series

Lead the way for us

Journal: Journal of Physics: Conference Series	Publication Date: Sep 1, 2018
License type: cc-by

Similar Papers

The ATLAS Event Service: A new approach to event processing
P Calafiura ... D Oleynik
Journal of Physics: Conference Series | VOL. 664
P Calafiura, et. al.P Calafiura ... D Oleynik
01 Dec 2015
Journal of Physics: Conference Series | VOL. 664

ATTITUDES AND CHARACTERISTICS OF WHITE-COLLAR PERSONNEL WHO ASSUMED PRODUCTION JOBS DURING A STRIKE.
...
Academy of Management Proceedings | VOL. 1975
, et. al. ...
01 Aug 1975
Academy of Management Proceedings | VOL. 1975

Extending ATLAS Computing to Commercial Clouds and Supercomputers
Paul Nilsson
-
Paul NilssonPaul Nilsson
17 Nov 2014
17 Nov 2014

Grid production with the ATLAS Event Service
...
EPJ Web of Conferences | VOL. 214
, et. al. ...
01 Jan 2019
EPJ Web of Conferences | VOL. 214

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Leveraging the checkpoint-restart technique for optimizing CPU efficiency of ATLAS production applications on opportunistic platforms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series