Abstract
PanDA executes millions of ATLAS jobs a month on Grid systems with more than 300,000 cores. Currently, PanDA is compatible only with few high-performance computing (HPC) resources due to different edge services and operational policies; does not implement the pilot paradigm on HPC; and does not dynamically optimize resource allocation among queues. We integrated the PanDA Harvester service and the RADICAL-Pilot (RP) system to overcome these limitations and enable the execution of ATLAS, Molecular Dy-namics and other workloads on HPC resources. This paper offer two main con-tributions: (1) introducing PanDA Harvester and RADICAL-Pilot, two systems independent developed to support high-throughput computing (HTC) on high-performance computing (HPC) infrastructures; (2) describing the integration between these two systems to produce a middleware component with unique functionalities, including the concurrent execution of heterogeneous workloads on the Titan OLCF machine. We integrated Harvester and RP by prototyping a Next Generation Executor (NGE) to expose RP capabilities and manage the execution of PanDA workloads. In this way, we minimized the reengineering of the two systems, allowing their integration while being in production.
Highlights
Production ANd Distributed Analysis (PanDA) [1] is the Workload Management System (WMS) [2] used by the ATLAS experiment at the Large Hadron Collider (LHC) to execute scientific applications on widely distributed resources
Pilots are relevant for LHC experiments, where millions of tasks are executed across multiple sites every month, analyzing and producing petabytes of data
We present the integration between Harvester (Section 2) and Generation Executor (NGE) [9] (Section 3)
Summary
Production ANd Distributed Analysis (PanDA) [1] is the Workload Management System (WMS) [2] used by the ATLAS experiment at the Large Hadron Collider (LHC) to execute scientific applications on widely distributed resources. Different from Harvester, NGE enables to submit a pilot job via the batch system of Titan and directly schedule tasks on the acquired resources without queuing on the machine batch system. In this way, tasks can be executed immediately while respecting the policies of the HPC machine. RADICAL-Pilot enables the concurrent execution of different tasks on Central processing units (CPU) and Graphics Processing Units (GPU), allowing for the full utilization of HPC worker nodes resources These capabilities will enable PanDA to transition from a workload management system designed to support the execution of ATLAS workloads, to a system for the execution of general purpose workloads on HPC machines
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.