PanDA and RADICAL-Pilot Integration: Enabling the Pilot Paradigm on HPC Resources

Andre Merzky,Matteo Turilli,Pavlo Svirin,A Forti,L Betev,P Hristov,M Litmaath,O Smirnova

doi:10.1051/epjconf/201921403057

Abstract

PanDA executes millions of ATLAS jobs a month on Grid systems with more than 300,000 cores. Currently, PanDA is compatible only with few high-performance computing (HPC) resources due to different edge services and operational policies; does not implement the pilot paradigm on HPC; and does not dynamically optimize resource allocation among queues. We integrated the PanDA Harvester service and the RADICAL-Pilot (RP) system to overcome these limitations and enable the execution of ATLAS, Molecular Dy-namics and other workloads on HPC resources. This paper offer two main con-tributions: (1) introducing PanDA Harvester and RADICAL-Pilot, two systems independent developed to support high-throughput computing (HTC) on high-performance computing (HPC) infrastructures; (2) describing the integration between these two systems to produce a middleware component with unique functionalities, including the concurrent execution of heterogeneous workloads on the Titan OLCF machine. We integrated Harvester and RP by prototyping a Next Generation Executor (NGE) to expose RP capabilities and manage the execution of PanDA workloads. In this way, we minimized the reengineering of the two systems, allowing their integration while being in production.

Highlights

Production ANd Distributed Analysis (PanDA) [1] is the Workload Management System (WMS) [2] used by the ATLAS experiment at the Large Hadron Collider (LHC) to execute scientific applications on widely distributed resources
Pilots are relevant for LHC experiments, where millions of tasks are executed across multiple sites every month, analyzing and producing petabytes of data
We present the integration between Harvester (Section 2) and Generation Executor (NGE) [9] (Section 3)

Summary

Introduction

Production ANd Distributed Analysis (PanDA) [1] is the Workload Management System (WMS) [2] used by the ATLAS experiment at the Large Hadron Collider (LHC) to execute scientific applications on widely distributed resources. Different from Harvester, NGE enables to submit a pilot job via the batch system of Titan and directly schedule tasks on the acquired resources without queuing on the machine batch system. In this way, tasks can be executed immediately while respecting the policies of the HPC machine. RADICAL-Pilot enables the concurrent execution of different tasks on Central processing units (CPU) and Graphics Processing Units (GPU), allowing for the full utilization of HPC worker nodes resources These capabilities will enable PanDA to transition from a workload management system designed to support the execution of ATLAS workloads, to a system for the execution of general purpose workloads on HPC machines

Harvester

RADICAL-Pilot and Next Generation Executor

Integrating Harvester and NGE

Conclusion and Next Steps

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

PanDA and RADICAL-Pilot Integration: Enabling the Pilot Paradigm on HPC Resources

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences

Lead the way for us

Journal: EPJ Web of Conferences	Publication Date: Jan 1, 2019
License type: CC BY 4.0

Similar Papers

Neuroscience Gateway � Cyberinfrastructure Providing Supercomputing Resources for Large Scale Computational Neuroscience Research
Majumdar Amitava ... Quintana Adrian
Frontiers in Neuroinformatics | VOL. 10
Majumdar Amitava, et. al.Majumdar Amitava ... Quintana Adrian
01 Jan 2015
Frontiers in Neuroinformatics | VOL. 10

Automating Job Monitoring System for an Ecosystem of High Performance Computing
Kajornsak Piyoungkorn ... Chalee Vorakulpipat
-
Kajornsak Piyoungkorn, et. al.Kajornsak Piyoungkorn ... Chalee Vorakulpipat
07 Nov 2017
07 Nov 2017

Network slicing to improve multicasting in HPC clusters
Izzat Alsmadi ... Abdallah Khreishah
Cluster Computing | VOL. 21
Izzat Alsmadi, et. al.Izzat Alsmadi ... Abdallah Khreishah
31 Jan 2018
Cluster Computing | VOL. 21

HPC resources for CMS offline computing: An integration and scalability challenge for the Submission Infrastructure
Antonio Pérez-Calero Yzquierdo ... Maria Acosta Flechas
EPJ Web of Conferences | VOL. -
Antonio Pérez-Calero Yzquierdo, et. al.Antonio Pérez-Calero Yzquierdo ... Maria Acosta Flechas
01 Jan 2024
EPJ Web of Conferences | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PanDA and RADICAL-Pilot Integration: Enabling the Pilot Paradigm on HPC Resources

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences