ATLAS Global Shares implementation in PanDA

Fernando Barreiro Megino,Alessandro Di Girolamo,Rodney Walker,Tadashi Maeno,Kaushik De,A Forti,M Litmaath,P Hristov,O Smirnova,L Betev

doi:10.1051/epjconf/201921403025

Abstract

PanDA (Production and Distributed Analysis) is the workload management system for ATLAS across the Worldwide LHC Computing Grid. While analysis tasks are submitted to PanDA by over a thousand users following personal schedules (e.g. PhD or conference deadlines), production campaigns are scheduled by a central Physics Coordination group based on the organization’s calendar. The Physics Coordination group needs to allocate the amount of Grid resources dedicated to each activity, in order to manage sharing of CPU resources among various parallel campaigns and to make sure that results can be achieved in time for important deadlines. While dynamic and static shares on batch systems have been around for a long time, we are trying to move away from local resource partitioning and manage shares at a global level in the PanDA system. The global solution is not straightforward, given different requirements of the activities (number of cores, memory, I/O and CPU intensity), the heterogeneity of Grid resources (site/HW capabilities, batch configuration and queue setup) and constraints on data locality. We have therefore started the Global Shares project that follows a requirements-driven multi-step execution plan, starting from definition of nestable shares, implementing share-aware job dispatch, aligning internal processes with global shares and finally implementing a pilot stream control for controlling the batch slots that keeps late binding. This contribution will explain the development work and architectural changes in PanDA to implement Global Shares, and describe how the Global Shares project has enabled the central control of resources and significantly reduced manual operations.

Highlights

With the exception of few dedicated resources for real time processing (Trigger and Tier0), all other workflows are scheduled across the Worldwide LHC Computing Grid (WLCG)[1] and other opportunistically used resources
The Global Shares project implements such control centrally in PanDA[2], the Workflow Management System used by ATLAS
2.1 Global Shares definition Global Shares establish the amount of resources available instantaneously to a certain activity as a fraction of the total amount of resources available to ATLAS

Summary

Motivation

ATLAS requires a vast computing infrastructure and a wide variety of workflows to study collisions and write physics papers (see Figure 1). These workflows have different requirements, e.g. Monte Carlo (MC) Generation and Simulation are CPU intensive, whilst Derivation jobs have higher I/O intensity. With the exception of few dedicated resources for real time processing (Trigger and Tier0), all other workflows are scheduled across the Worldwide LHC Computing Grid (WLCG)[1] and other opportunistically used resources. Traditional Grid resources are dedicated compute centers in universities or other laboratories. They are ranked into Tier-1, Tier-2 and Tier-3 in an attempt to describe their capabilities, service levels and usage. The Global Shares project implements such control centrally in PanDA[2], the Workflow Management System used by ATLAS

Implementation

Tagging of tasks and jobs

PanDA job generation chain

Unified PanDA queues

Findings

Conclusions and future work