Abstract
ATLAS Distributed Computing during LHC Run-1 was challenged by steadily increasing computing, storage and network requirements. In addition, the complexity of processing task workflows and their associated data management requirements led to a new paradigm in the ATLAS computing model for Run-2, accompanied by extensive evolution and redesign of the workflow and data management systems. The new systems were put into production at the end of 2014, and gained robustness and maturity during 2015 data taking. ProdSys2, the new request and task interface; JEDI, the dynamic job execution engine developed as an extension to PanDA; and Rucio, the new data management system, form the core of Run-2 ATLAS distributed computing engine.One of the big changes for Run-2 was the adoption of the Derivation Framework, which moves the chaotic CPU and data intensive part of the user analysis into the centrally organized train production, delivering derived AOD datasets to user groups for final analysis. The effectiveness of the new model was demonstrated through the delivery of analysis datasets to users just one week after data taking, by completing the calibration loop, Tier-0 processing and train production steps promptly. The great flexibility of the new system also makes it possible to execute part of the Tier-0 processing on the grid when Tier-0 resources experience a backlog during high data-taking periods.The introduction of the data lifetime model, where each dataset is assigned a finite lifetime (with extensions possible for frequently accessed data), was made possible by Rucio. Thanks to this the storage crises experienced in Run-1 have not reappeared during Run-2. In addition, the distinction between Tier-1 and Tier-2 disk storage, now largely artificial given the quality of Tier-2 resources and their networking, has been removed through the introduction of dynamic ATLAS clouds that group the storage endpoint nucleus and its close-by execution satellite sites. All stable ATLAS sites are now able to store unique or primary copies of the datasets.ATLAS Distributed Computing is further evolving to speed up request processing by introducing network awareness, using machine learning and optimisation of the latencies during the execution of the full chain of tasks. The Event Service, a new workflow and job execution engine, is designed around check-pointing at the level of event processing to use opportunistic resources more efficiently.ATLAS has been extensively exploring possibilities of using computing resources extending beyond conventional grid sites in the WLCG fabric to deliver as many computing cycles as possible and thereby enhance the significance of the Monte-Carlo samples to deliver better physics results.The exploitation of opportunistic resources was at an early stage throughout 2015, at the level of 10% of the total ATLAS computing power, but in the next few years it is expected to deliver much more. In addition, demonstrating the ability to use an opportunistic resource can lead to securing ATLAS allocations on the facility, hence the importance of this work goes beyond merely the initial CPU cycles gained.In this paper, we give an overview and compare the performance, development effort, flexibility and robustness of the various approaches.
Highlights
N.b.: this talk will focus on an ATLAS Distributed Computing overview, for more details check the 40+ other contributions from the ADC community
○ MONARC model is gone ○ Every stable site can store primary data ○ Every site well connected to nucleus can process data ○ All associations are fully dynamic at the task and job brokering level ○ See 151
○ Memory and walltime of jobs are measured for first 10 jobs of a task and set for the rest ○ Retries of failed jobs have increased memory or walltime if that was the reason for failure
Summary
N.b.: this talk will focus on an ATLAS Distributed Computing overview, for more details check the 40+ other contributions from the ADC community. ○ Duty cycle ~80% ○ Higher luminosity in 2016 ○ 50% more data delivered than expected. ● Allocated computing resources are not sufficient to cope with the data-taking load but. ● Sites provide more CPU power than requested ● New production framework developed during. ATLAS computing works extremely well and provides the results on time for conferences
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.