Managing the ATLAS Grid through Harvester

Fernando Harald Barreiro Megino,Ivan Glushkov,Fahui Lin,Frank Berghaus,Kaushik De,Nicolò Magini,Aleksandr Alekseev,David Cameron,Tadashi Maeno,Andrej Filipcic,C Doglioni,G.A Stewart,L Silvestris,P Jackson,W Kamleh,D Kim

doi:10.1051/epjconf/202024503010

Abstract

ATLAS Computing Management has identified the migration of all computing resources to Harvester, PanDA’s new workload submission engine, as a critical milestone for LHC Run 3 and 4. This contribution will focus on the Grid migration to Harvester. We have built a redundant architecture based on CERN IT’s common offerings (e.g. Openstack Virtual Machines and Database on Demand) to run the necessary Harvester and HTCondor services, capable of sustaining the load of O(1M) workers on the Grid per day. We have reviewed the ATLAS Grid region by region and moved as much possible away from blind worker submission, where multiple queues (e.g. single core, multi core, high memory) compete for resources on a site. Instead we have migrated towards more intelligent models that use information and priorities from the central PanDA workload management system and stream the right number of workers of each category to a unified queue while keeping late binding to the jobs. We will also describe our enhanced monitoring and analytics framework. Worker and job information is synchronized with minimal delays to a CERN IT provided ElasticSearch repository, where we can interact with dashboards to follow submission progress, discover site issues (e.g. broken Compute Elements) or spot empty workers. The result is a much more efficient usage of the Grid resources with smart, built-in monitoring of resources.

Highlights

The Worldwide Large Hadron Collider (LHC) Computing Grid (WLCG) [1] is a highly heterogeneous federation of computing sites with different middleware and increasingly special resources, such as Cloud or High Performance Computing (HPC) resources
PanDA [2] is the Workload Management System (WMS) for the ATLAS experiment [3] at the Large Hadron Collider (LHC), managing all production and user jobs across the Worldwide LHC Computing Grid (WLCG) centers associated with the experiment
The Harvester project was born as an attempt to provide a universal Pilot submission system

Summary

Introduction

The Worldwide LHC Computing Grid (WLCG) [1] is a highly heterogeneous federation of computing sites with different middleware and increasingly special resources, such as Cloud or High Performance Computing (HPC) resources. In order to exploit resources, PanDA is based on the Pilot paradigm [4], where Pilot jobs are submitted to the batch systems at sites. Over the years numerous Pilot submission systems have been developed, frequently specializing on a certain subset of resources and having independent code bases. In the case of Grid resources, some improvements were needed to increase the stability and usage efficiency through a tighter integration with the PanDA Workload Management System allowing a more informed decision taking. This contribution will focus on core Harvester design decisions and other significant aspects like new submission modes and monitoring. We will show the process and results of migrating all Grid resources to Harvester

Lightweight vs High Performance execution modes

Fast integration of new resources

Queue unification

Submission modes supported by Harvester

Worker monitoring

Service monitoring

Site monitoring

Central infrastructure

Migration process

Results

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ Web of Conferences	Publication Date: Jan 1, 2020
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Managing the ATLAS Grid through Harvester

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences

Lead the way for us

Similar Papers

The keys to CERN conference rooms - Managing local collaboration facilities in large organisations
T Baron ... G Duran
Journal of Physics: Conference Series | VOL. 513
T Baron, et. al.T Baron ... G Duran
11 Jun 2014
Journal of Physics: Conference Series | VOL. 513

Consideration on duplex modes and resource allocation algorithms for MP-MP BFWA networks carrying asymmetric traffic
S Konishi ... S Nomoto
-
S Konishi, et. al.S Konishi ... S Nomoto
07 Aug 2002
07 Aug 2002

Temperature aware resource scheduling in Green Clouds
Amritpal Kaur ... Supriya Kinger
-
Amritpal Kaur, et. al.Amritpal Kaur ... Supriya Kinger
01 Aug 2013
01 Aug 2013

The Design and Implementation of MCFlow: a Real-time Multi-core Aware Middleware for Dependent Task Graphs
...
-
, et. al. ...
04 Nov 2014
04 Nov 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Managing the ATLAS Grid through Harvester

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences