ATLAS Distributed Computing Experience and Performance During the LHC Run-2

A Filipčič

doi:10.1088/1742-6596/898/5/052015

Abstract

ATLAS Distributed Computing during LHC Run-1 was challenged by steadily increasing computing, storage and network requirements. In addition, the complexity of processing task workflows and their associated data management requirements led to a new paradigm in the ATLAS computing model for Run-2, accompanied by extensive evolution and redesign of the workflow and data management systems. The new systems were put into production at the end of 2014, and gained robustness and maturity during 2015 data taking. ProdSys2, the new request and task interface; JEDI, the dynamic job execution engine developed as an extension to PanDA; and Rucio, the new data management system, form the core of Run-2 ATLAS distributed computing engine.One of the big changes for Run-2 was the adoption of the Derivation Framework, which moves the chaotic CPU and data intensive part of the user analysis into the centrally organized train production, delivering derived AOD datasets to user groups for final analysis. The effectiveness of the new model was demonstrated through the delivery of analysis datasets to users just one week after data taking, by completing the calibration loop, Tier-0 processing and train production steps promptly. The great flexibility of the new system also makes it possible to execute part of the Tier-0 processing on the grid when Tier-0 resources experience a backlog during high data-taking periods.The introduction of the data lifetime model, where each dataset is assigned a finite lifetime (with extensions possible for frequently accessed data), was made possible by Rucio. Thanks to this the storage crises experienced in Run-1 have not reappeared during Run-2. In addition, the distinction between Tier-1 and Tier-2 disk storage, now largely artificial given the quality of Tier-2 resources and their networking, has been removed through the introduction of dynamic ATLAS clouds that group the storage endpoint nucleus and its close-by execution satellite sites. All stable ATLAS sites are now able to store unique or primary copies of the datasets.ATLAS Distributed Computing is further evolving to speed up request processing by introducing network awareness, using machine learning and optimisation of the latencies during the execution of the full chain of tasks. The Event Service, a new workflow and job execution engine, is designed around check-pointing at the level of event processing to use opportunistic resources more efficiently.ATLAS has been extensively exploring possibilities of using computing resources extending beyond conventional grid sites in the WLCG fabric to deliver as many computing cycles as possible and thereby enhance the significance of the Monte-Carlo samples to deliver better physics results.The exploitation of opportunistic resources was at an early stage throughout 2015, at the level of 10% of the total ATLAS computing power, but in the next few years it is expected to deliver much more. In addition, demonstrating the ability to use an opportunistic resource can lead to securing ATLAS allocations on the facility, hence the importance of this work goes beyond merely the initial CPU cycles gained.In this paper, we give an overview and compare the performance, development effort, flexibility and robustness of the various approaches.

Highlights

N.b.: this talk will focus on an ATLAS Distributed Computing overview, for more details check the 40+ other contributions from the ADC community
○ MONARC model is gone ○ Every stable site can store primary data ○ Every site well connected to nucleus can process data ○ All associations are fully dynamic at the task and job brokering level ○ See 151
○ Memory and walltime of jobs are measured for first 10 jobs of a task and set for the rest ○ Retries of failed jobs have increased memory or walltime if that was the reason for failure

Summary

Andrej Filipčič on behalf of the ATLAS Collaboration

N.b.: this talk will focus on an ATLAS Distributed Computing overview, for more details check the 40+ other contributions from the ADC community. ○ Duty cycle ~80% ○ Higher luminosity in 2016 ○ 50% more data delivered than expected. ● Allocated computing resources are not sufficient to cope with the data-taking load but. ● Sites provide more CPU power than requested ● New production framework developed during. ATLAS computing works extremely well and provides the results on time for conferences

Distributed Storage

Data management

Upcoming features

Conclusions

References to other ADC talks

Findings

Possible guideline for the speech

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Physics: Conference Series	Publication Date: Oct 1, 2017
Citations: 5	License type: cc-by

R Discovery Prime

R Discovery Prime

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series

Lead the way for us

Similar Papers

Overview of the ATLAS distributed computing system
Johannes Elmsheuser ... L Betev
EPJ Web of Conferences | VOL. 214
Johannes Elmsheuser, et. al.Johannes Elmsheuser ... L Betev
01 Jan 2019
EPJ Web of Conferences | VOL. 214

Advancing Synthetic Ecology: A Database System to Facilitate Complex Ecological Meta‐Analyses
V Bala Chaudhary ... Gail W.T Wilson
The Bulletin of the Ecological Society of America | VOL. 91
V Bala Chaudhary, et. al.V Bala Chaudhary ... Gail W.T Wilson
01 Apr 2010
The Bulletin of the Ecological Society of America | VOL. 91

Introductory evidence on data management and practice systems of forensic autopsies in sudden and unnatural deaths: a scoping review
Salona Prahladh ... Jacqueline Van Wyk
Egyptian Journal of Forensic Sciences | VOL. 12
Salona Prahladh, et. al.Salona Prahladh ... Jacqueline Van Wyk
01 Jan 2021
Egyptian Journal of Forensic Sciences | VOL. 12

Comparative analysis of data management system
Chengdu Yin ... Xin Lin
-
Chengdu Yin, et. al.Chengdu Yin ... Xin Lin
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series