Integration of Titan supercomputer at OLCF with ATLAS Production System

F Barreiro Megino,S Panitkin,D Oleynik,P Nilsson,A Klimentov,S Jha,T Wenaus,S Padolski,J Wells,K De,T Maeno

doi:10.1088/1742-6596/898/9/092002

Abstract

The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan’s batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan’s multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titan millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titans utilization efficiency. We will discuss the details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency.Notice: This manuscript has been authored, by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

Highlights

The ATLAS experiment [1] is one of the four major experiments at the Large Hadron Collider (LHC)
In this paper we will describe a project aimed at integration of the ATLAS production system with the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF)
More details about the ATLAS production system can be found in the Ref [6]

Summary

Introduction

The ATLAS experiment [1] is one of the four major experiments at the Large Hadron Collider (LHC). It is designed to test predictions of Standard Model and explore fundamental building blocks of matter and their interactions as well as novel physics at the highest energy available in the laboratory. In order to achieve its scientific goals ATLAS employs massive computing infrastructure. It currently uses more than 250,000 CPU cores deployed in a global Grid [2, 3],. In this paper we will describe a project aimed at integration of the ATLAS production system with the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF)

PanDA workload management system

ATLAS production system

Titan at OLCF

Integration with Titan

Running ATLAS production on Titan

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Physics: Conference Series	Publication Date: Oct 1, 2017
Citations: 5	License type: cc-by

R Discovery Prime

R Discovery Prime

Integration of Titan supercomputer at OLCF with ATLAS Production System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series

Lead the way for us

Similar Papers

Integration of PanDA workload management system with Titan supercomputer at OLCF
K De ... A Klimentov
Journal of Physics: Conference Series | VOL. 664
K De, et. al.K De ... A Klimentov
01 Dec 2015
Journal of Physics: Conference Series | VOL. 664

Integration of Panda Workload Management System with supercomputers
K De ... T Maeno
Physics of Particles and Nuclei Letters | VOL. 13
K De, et. al.K De ... T Maeno
01 Sep 2016
Physics of Particles and Nuclei Letters | VOL. 13

PanDA Workload Management System Meta-data Segmentation
M Golosova ... E Ryabinkin
Procedia Computer Science | VOL. 66
M Golosova, et. al.M Golosova ... E Ryabinkin
01 Jan 2015
Procedia Computer Science | VOL. 66

The future of PanDA in ATLAS distributed computing
K De ... D Oleynik
Journal of Physics: Conference Series | VOL. 664
K De, et. al.K De ... D Oleynik
01 Dec 2015
Journal of Physics: Conference Series | VOL. 664

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integration of Titan supercomputer at OLCF with ATLAS Production System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series