Abstract

The ATLAS experiment at the LHC relies on a complex and distributed Trigger and Data Acquisition (TDAQ) system to gather and select particle collision data. The Event Filter (EF) component of the TDAQ system is responsible for executing advanced selection algorithms, reducing the data rate to a level suitable for recording to permanent storage. The EF functionality is provided by a computing farm made up of thousands of commodity servers, each executing one or more processes. Moving the EF farm management towards a solution based on software containers is one of the main themes of the ATLAS TDAQ Phase-II upgrades in the area of the online software; it would make it possible to open new possibilities for fault tolerance, reliability and scalability. This paper presents the results of an evaluation of Kubernetes as a possible orchestrator of the ATLAS TDAQ EF computing farm. Kubernetes is a system for advanced management of containerized applications in large clusters. This paper will first highlight some of the technical solutions adopted to run the offline version of today’s EF software in a Docker container. Then it will focus on some scaling performance measurements executed with a cluster of 1000 CPU cores. In particular, this paper will report about the way Kubernetes scales in deploying containers as a function of the cluster size and show how a proper tuning of the Query per Second (QPS) Kubernetes parameter set can improve the scaling of applications in terms of running replicas. Finally, an assessment will be given about the possibility to use Kubernetes as an orchestrator of the EF computing farm in LHC’s Run 4.

Highlights

  • During Run 2, the Large Hadron Collider (LHC) [1] operated at a centre-of-mass energy of 13 TeV, with a peak luminosity of about 2.0 x 1034 cm-2 s-1 and more than 60 interactions perFrom ATL-DAQ-PROC-2018-022

  • In order to minimize the impact of the started applications on the measurement, a pause container was used and its image was pre-pulled into the cluster

  • Assuming no higher order effects with larger clusters (Kubernetes officially supports 5000 hosts clusters), an Event Filter (EF) processing units (PUs) service instance can be fully deployed on each node of a 3000 host cluster in about 35 seconds (Figure 4), matching the corresponding performance figures in Run 2 after a proper choice of the Query per Second (QPS) values

Read more

Summary

Introduction

During Run 2, the Large Hadron Collider (LHC) [1] operated at a centre-of-mass energy of 13 TeV, with a peak luminosity of about 2.0 x 1034 cm-2 s-1 and more than 60 interactions per. The upgraded TDAQ system will sustain an input rate of 1 MHz (10 times more than in Run 2) with an average event size of about 5 MB (4 times more than in Run 2). It will include a large IT infrastructure, with thousands of computing nodes and applications to supervise. The following chapters will focus on the evaluation of a possible candidate to orchestrate the EF computing farm operations

Event Filter farm orchestration
Event Filter processing units in software containers
Performance and scaling
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call