Abstract
Distributed computing resources available for high-energy physics research are becoming less dedicated to one type of workflow and researchers workloads are increasingly exploiting modern computing technologies such as parallelism. The current pilot job management model used by many experiments relies on static dedicated resources and cannot easily adapt to these changes. The model used for ATLAS in Nordic countries and some other places enables a flexible job management system based on dynamic resources allocation. Rather than a fixed set of resources managed centrally, the model allows resources to be requested on the fly. The ARC Computing Element (ARC-CE) and ARC Control Tower (aCT) are the key components of the model. The aCT requests jobs from the ATLAS job management system (PanDA) and submits a fully-formed job description to ARC-CEs. ARC-CE can then dynamically request the required resources from the underlying batch system. In this paper we describe the architecture of the model and the experience of running many millions of ATLAS jobs on it.
Highlights
Distributed computing resources available for high-energy physics research are becoming less dedicated to one type of workflow and researchers workloads are increasingly exploiting modern computing technologies such as parallelism
In this paper we describe the architecture of the model and the experience of running many millions of ATLAS jobs on it
The classic batch system approach with job resource requirements known at the time of the submission has been successful elsewhere and continues to be successful in the high performance computing (HPC) world today, the pilot mode in the grid world has made many issues related to infrastructure or services instabilities irrelevant by design
Summary
The job submission and execution instabilities experienced within the grid environment ten years ago led to the rejection of the direct payload submission practice in favor of the pilot mode submission. Universal grid jobs called pilots are submitted to the computing elements and subsequently to the underlying batch systems. When they start execution on the worker nodes, they contact the central scheduling system to receive the job description, or in other words, they pull the jobs from the virtual organization scheduler. The central scheduler would manage the job execution order through priorities and fair-share of virtual organizations or user groups. This was never considered to be an option due to the diversity and complexity of the computing sites, nor was suitable due to administrative or political restrictions. The central scheduling and the site scheduling systems need to adapt to each other
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.