Raythena: a vertically integrated scheduler for ATLAS applications on heterogeneous distributed resources

Miha Muškinja,Charles Leggett,Paolo Calafiura,Vakho Tsulaia,Illya Shapoval,C Doglioni,G.A Stewart,D Kim,P Jackson,W Kamleh,L Silvestris

doi:10.1051/epjconf/202024505042

Abstract

The ATLAS experiment has successfully integrated HighPerformance Computing resources (HPCs) in its production system. Unlike the current generation of HPC systems, and the LHC computing grid, the next generation of supercomputers is expected to be extremely heterogeneous in nature: different systems will have radically different architectures, and most of them will provide partitions optimized for different kinds of workloads. In this work we explore the applicability of concepts and tools realized in Ray (the high-performance distributed execution framework targeting large-scale machine learning applications) to ATLAS event throughput optimization on heterogeneous distributed resources, ranging from traditional grid clusters to Exascale computers. We present a prototype of Raythena, a Ray-based implementation of the ATLAS Event Service (AES), a fine-grained event processing workflow aimed at improving the efficiency of ATLAS workflows on opportunistic resources, specifically HPCs. The AES is implemented as an event processing task farm that distributes packets of events to several worker processes running on multiple nodes. Each worker in the task farm runs an event-processing application (Athena) as a daemon. The whole system is orchestrated by Ray, which assigns work in a distributed, possibly heterogeneous, environment. For all its flexibility, the AES implementation is currently comprised of multiple separate layers that communicate through ad-hoc command-line and filebased interfaces. The goal of Raythena is to integrate these layers through a feature-rich, efficient application framework. Besides increasing usability and robustness, a vertically integrated scheduler will enable us to explore advanced concepts such as dynamically shaping of workflows to exploit currently available resources, particularly on heterogeneous systems.

Highlights

Efficient data processing is of key importance in the ATLAS experiment
Unlike the current generation of HighPerformance Computing resources (HPCs) systems, the generation of supercomputers is expected to be extremely heterogeneous in nature: different systems will have radically different architectures, and most of them will provide partitions optimized for different kinds of workloads (e.g. NERSC’s Cori GPU nodes)
This is suitable for opportunistic resources such as HPCs, where nodes can only be allocated for a certain amount of time and the most efficient way to use this allocation is to process as many events as possible, i.e. not a pre-determined amount

Summary

Introduction

Efficient data processing is of key importance in the ATLAS experiment. For a big majority of its offline data processing the ATLAS experiment is utilizing the Worldwide LHC Computing Grid (WLCG). For Run 2 data processing on HPCs the ‘Yoda/Droid’ workflow [4] was used and only the ATLAS Geant simulation tasks were handled. The advantage of the ES is that the number of input events does not need to be known in advance and that the output is generated on an event-by-event basis This is suitable for opportunistic resources such as HPCs, where nodes can only be allocated for a certain amount of time (typically few hours) and the most efficient way to use this allocation is to process as many events as possible, i.e. not a pre-determined amount. A typical ‘Yoda/Droid’ job size at NERSC’s Cori HPC for Run 2 ATLAS event simulation used over 100 KNL nodes where a 136-process AthenaMP application ran on each node. Nodes have outbound network connection and inter-node communication is supported with MPI and TCP/IP

Ray-based ATLAS Event Service application

Summary

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Raythena: a vertically integrated scheduler for ATLAS applications on heterogeneous distributed resources

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences

Lead the way for us

Journal: EPJ Web of Conferences	Publication Date: Jan 1, 2020
License type: CC BY 4.0

Similar Papers

A Formal Framework for Complex Event Processing
...
-
, et. al. ...
01 Jan 2019
01 Jan 2019

Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)
Paolo Calafiura ... Rolf Seuster
Journal of Physics: Conference Series | VOL. 664
Paolo Calafiura, et. al.Paolo Calafiura ... Rolf Seuster
01 Dec 2015
Journal of Physics: Conference Series | VOL. 664

Execution of Multi-Perspective Declarative Process Models Using Complex Event Processing
Niklas Ruhkamp ... Stefan Schönig
Business Information Systems | VOL. -
Niklas Ruhkamp, et. al.Niklas Ruhkamp ... Stefan Schönig
02 Jul 2021
Business Information Systems | VOL. -

Hardware Design for C-Based Complex Event Processing
Hiroaki Inoue ... Takashi Takenaka
-
Hiroaki Inoue, et. al.Hiroaki Inoue ... Takashi Takenaka
01 Nov 2012
01 Nov 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Raythena: a vertically integrated scheduler for ATLAS applications on heterogeneous distributed resources

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences