Abstract

ATLAS Computing Management has identified the migration of all computing resources to Harvester, PanDA’s new workload submission engine, as a critical milestone for LHC Run 3 and 4. This contribution will focus on the Grid migration to Harvester. We have built a redundant architecture based on CERN IT’s common offerings (e.g. Openstack Virtual Machines and Database on Demand) to run the necessary Harvester and HTCondor services, capable of sustaining the load of O(1M) workers on the Grid per day. We have reviewed the ATLAS Grid region by region and moved as much possible away from blind worker submission, where multiple queues (e.g. single core, multi core, high memory) compete for resources on a site. Instead we have migrated towards more intelligent models that use information and priorities from the central PanDA workload management system and stream the right number of workers of each category to a unified queue while keeping late binding to the jobs. We will also describe our enhanced monitoring and analytics framework. Worker and job information is synchronized with minimal delays to a CERN IT provided ElasticSearch repository, where we can interact with dashboards to follow submission progress, discover site issues (e.g. broken Compute Elements) or spot empty workers. The result is a much more efficient usage of the Grid resources with smart, built-in monitoring of resources.

Highlights

  • The Worldwide Large Hadron Collider (LHC) Computing Grid (WLCG) [1] is a highly heterogeneous federation of computing sites with different middleware and increasingly special resources, such as Cloud or High Performance Computing (HPC) resources

  • PanDA [2] is the Workload Management System (WMS) for the ATLAS experiment [3] at the Large Hadron Collider (LHC), managing all production and user jobs across the Worldwide LHC Computing Grid (WLCG) centers associated with the experiment

  • The Harvester project was born as an attempt to provide a universal Pilot submission system

Read more

Summary

Introduction

The Worldwide LHC Computing Grid (WLCG) [1] is a highly heterogeneous federation of computing sites with different middleware and increasingly special resources, such as Cloud or High Performance Computing (HPC) resources. In order to exploit resources, PanDA is based on the Pilot paradigm [4], where Pilot jobs are submitted to the batch systems at sites. Over the years numerous Pilot submission systems have been developed, frequently specializing on a certain subset of resources and having independent code bases. In the case of Grid resources, some improvements were needed to increase the stability and usage efficiency through a tighter integration with the PanDA Workload Management System allowing a more informed decision taking. This contribution will focus on core Harvester design decisions and other significant aspects like new submission modes and monitoring. We will show the process and results of migrating all Grid resources to Harvester

Lightweight vs High Performance execution modes
Fast integration of new resources
Queue unification
Submission modes supported by Harvester
Worker monitoring
Service monitoring
Site monitoring
Central infrastructure
Migration process
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.