Abstract

Experiments at the Large Hadron Collider (LHC) face unprecedented computing challenges. Heterogeneous resources are distributed worldwide at hundreds of sites, thousands of physicists analyse the data remotely, the volume of processed data is beyond the exabyte scale, while data processing requires more than a few billion hours of computing usage per year. The PanDA (Production and Distributed Analysis) system was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. In the process, the old batch job paradigm of locally managed computing in HEP was discarded in favour of a far more automated, flexible and scalable model. The success of PanDA in ATLAS is leading to widespread adoption and testing by other experiments. PanDA is the first exascale workload management system in HEP, already operating at more than a million computing jobs per day, and processing over an exabyte of data in 2013. There are many new challenges that PanDA will face in the near future, in addition to new challenges of scale, heterogeneity and increasing user base. PanDA will need to handle rapidly changing computing infrastructure, will require factorization of code for easier deployment, will need to incorporate additional information sources including network metrics in decision making, be able to control network circuits, handle dynamically sized workload processing, provide improved visualization, and face many other challenges. In this talk we will focus on the new features, planned or recently implemented, that are relevant to the next decade of distributed computing workload management using PanDA.

Highlights

  • PanDA = Production and Distributed Analysis System– Designed to meet ATLAS production/analysis requirements for a data-driven workload management system capable of operating at LHC data processing scalePanDA has performed well for ATLAS including the LHC Run1 data taking period– Producing high volume Monte-Carlo samples and making huge computing resources available for individual analysis Running ~150K jobs concurrently Processing ~0.7 million (~1.5 million at peak) jobs per day– Being actively evolved to meet the rapidly changing requirements for analysis use cases No significant service disruptionsNew developments for Run 2 and beyond

  • New components and features have been delivered to ATLAS before LHC Run 2

  • Many developments and challenges to come while steadily running for LHC Run 2

Read more

Summary

The Future of PanDA in ATLAS Distributed Computing

University of Texas at Arlington, USA Brookhaven National Laboratory, USA Joint Inst. for Nuclear Research, RU Argonne National Laboratory, USA

Introduction
Event level partitioning to minimize losses due to early terminations
Usage of WAN data access for user jobs
The fine grained partitioning of processing
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call