Data-flow Performance Optimisation on Unreliable Networks: the ATLAS Data-Acquisition Case

Tommaso Colombo

doi:10.1088/1742-6596/608/1/012005

Abstract

The ATLAS detector at CERN records proton-proton collisions delivered by the Large Hadron Collider (LHC). The ATLAS Trigger and Data-Acquisition (TDAQ) system identifies, selects, and stores interesting collision data. These are received from the detector readout electronics at an average rate of 100 kHz. The typical event data size is 1 to 2 MB. Overall, the ATLAS TDAQ system can be seen as a distributed software system executed on a farm of roughly 2000 commodity PCs. The worker nodes are interconnected by an Ethernet network that at the restart of the LHC in 2015 is expected to experience a sustained throughput of several 10 GB/s.A particular type of challenge posed by this system, and by DAQ systems in general, is the inherently burstynature of the data traffic from the readout buffers to the worker nodes. This can cause instantaneous network congestion and therefore performance degradation. The effect is particularly pronounced for unreliable network interconnections, such as Ethernet.In this paper we report on the design of the data-flow software for the 2015-2018 data-taking period of the ATLAS experiment. This software will be responsible for transporting the data across the distributed Data-Acquisition system. We will focus on the strategies employed to manage the network congestion and therefore minimisethe data-collection latency and maximisethe system performance. We will discuss the results of systematic measurements performed on different types of networking hardware. These results highlight the causes of network congestion and the effects on the overall system performance.

Highlights

A particular type of challenge posed by this system, and by DAQ systems in general, is the inherently bursty nature of the data traffic from the readout buffers to the worker nodes
The ATLAS detector at CERN records proton-proton collisions delivered by the Large Hadron Collider (LHC)
Introduction: the ATLAS Trigger and Data-Acquisition system in 2015–2018 ATLAS [1] is one of the experiments installed at the Large Hadron Collider (LHC), CERN, Geneva, Switzerland

Summary

Introduction

A particular type of challenge posed by this system, and by DAQ systems in general, is the inherently bursty nature of the data traffic from the readout buffers to the worker nodes. When an event is accepted by the Level-1, its data fragments (1860 fragments of variable sizes around 1 kB) are distributed, using custom optical links, to hardware buffers in the Readout System (ROS) nodes. 3. Preventing packet drops: a client-side traffic shaping algorithm Smoothing the rate of data requests generated by a HLT node can alleviate the network congestion by controlling the maximum size of the traffic burst from the ROS.

Results

Conclusion