PanDA for ATLAS distributed computing in the next decade

F H Barreiro Megino,P Nilsson,T Maeno,A Klimentov,D Oleynik,S Panitkin,S Padolski,K De,T Wenaus

doi:10.1088/1742-6596/898/5/052002

Abstract

The Production and Distributed Analysis (PanDA) system has been developed to meet ATLAS production and analysis requirements for a data-driven workload management system capable of operating at the Large Hadron Collider (LHC) data processing scale. Heterogeneous resources used by the ATLAS experiment are distributed worldwide at hundreds of sites, thousands of physicists analyse the data remotely, the volume of processed data is beyond the exabyte scale, dozens of scientific applications are supported, while data processing requires more than a few billion hours of computing usage per year. PanDA performed very well over the last decade including the LHC Run 1 data taking period. However, it was decided to upgrade the whole system concurrently with the LHC’s first long shutdown in order to cope with rapidly changing computing infrastructure. After two years of reengineering efforts, PanDA has embedded capabilities for fully dynamic and flexible workload management. The static batch job paradigm was discarded in favor of a more automated and scalable model. Workloads are dynamically tailored for optimal usage of resources, with the brokerage taking network traffic and forecasts into account. Computing resources are partitioned based on dynamic knowledge of their status and characteristics. The pilot has been re-factored around a plugin structure for easier development and deployment. Bookkeeping is handled with both coarse and fine granularities for efficient utilization of pledged or opportunistic resources. An in-house security mechanism authenticates the pilot and data management services in off-grid environments such as volunteer computing and private local clusters. The PanDA monitor has been extensively optimized for performance and extended with analytics to provide aggregated summaries of the system as well as drill-down to operational details. There are as well many other challenges planned or recently implemented, and adoption by non-LHC experiments such as bioinformatics groups successfully running Paleomix (microbial genome and metagenomes) payload on supercomputers. In this paper we will focus on the new and planned features that are most important to the next decade of distributed computing workload management.

Highlights

➢ New components and features have been delivered to ATLAS
➢ Many developments and challenges to come while steadily running for LHC Run 2
– Pilot 2.0 – Harvester – Network provisioning – Automation based on prediction capabilities – More optimal use of computing resources

Summary

Introduction

– Designed to meet ATLAS production/analysis requirements for a data-driven workload management system capable of operating at LHC data processing scale. – Performed well for ATLAS in the last decade including the LHC data taking period. ➢ The system has revealed great improvements in LHC Run but still has issues to be addressed. ➢ Inefficiency due to old resource partitioning based on geographical grouping of computing centers. ➢ Suboptimal usages of non-traditional resources due to job-based workload management. ➢ To leverage prediction capabilities for resource availability actively developed with recent computing technologies like machine learning

Resource Consolidation

Intelligent Brokerage

PanDA at HPC Centers

Event Service

Summary