Dynamic Resource Allocation with the arcControlTower

A Filipčič,J K Nilsen,D Cameron

doi:10.1088/1742-6596/664/6/062015

A Filipčič, J K Nilsen + Show 1 more

Open Access

https://doi.org/10.1088/1742-6596/664/6/062015

Copy DOI

Journal: Journal of Physics: Conference Series	Publication Date: Dec 1, 2015
Citations: 1	License type: cc-by

Affiliation: University of Oslo

Abstract

Distributed computing resources available for high-energy physics research are becoming less dedicated to one type of workflow and researchers workloads are increasingly exploiting modern computing technologies such as parallelism. The current pilot job management model used by many experiments relies on static dedicated resources and cannot easily adapt to these changes. The model used for ATLAS in Nordic countries and some other places enables a flexible job management system based on dynamic resources allocation. Rather than a fixed set of resources managed centrally, the model allows resources to be requested on the fly. The ARC Computing Element (ARC-CE) and ARC Control Tower (aCT) are the key components of the model. The aCT requests jobs from the ATLAS job management system (PanDA) and submits a fully-formed job description to ARC-CEs. ARC-CE can then dynamically request the required resources from the underlying batch system. In this paper we describe the architecture of the model and the experience of running many millions of ATLAS jobs on it.

Highlights

Distributed computing resources available for high-energy physics research are becoming less dedicated to one type of workflow and researchers workloads are increasingly exploiting modern computing technologies such as parallelism
In this paper we describe the architecture of the model and the experience of running many millions of ATLAS jobs on it
The classic batch system approach with job resource requirements known at the time of the submission has been successful elsewhere and continues to be successful in the high performance computing (HPC) world today, the pilot mode in the grid world has made many issues related to infrastructure or services instabilities irrelevant by design

Summary

Payload Submission Practice

The job submission and execution instabilities experienced within the grid environment ten years ago led to the rejection of the direct payload submission practice in favor of the pilot mode submission. Universal grid jobs called pilots are submitted to the computing elements and subsequently to the underlying batch systems. When they start execution on the worker nodes, they contact the central scheduling system to receive the job description, or in other words, they pull the jobs from the virtual organization scheduler. The central scheduler would manage the job execution order through priorities and fair-share of virtual organizations or user groups. This was never considered to be an option due to the diversity and complexity of the computing sites, nor was suitable due to administrative or political restrictions. The central scheduling and the site scheduling systems need to adapt to each other

ATLAS Job Submission Modes

Findings

Conclusions