Abstract

The Data Acquisition (DAQ) system of the Compact Muon Solenoid (CMS) experiment at the LHC is a complex system responsible for the data readout, event building and recording of accepted events. Its proper functioning plays a critical role in the data-taking efficiency of the CMS experiment. In order to ensure high availability and recover promptly in the event of hardware or software failure of the subsystems, an expert system, the DAQ Expert, has been developed. It aims at improving the data taking efficiency, reducing the human error in the operations and minimising the on-call expert demand. Introduced in the beginning of 2017, it assists the shift crew and the system experts in recovering from operational faults, streamlining the post mortem analysis and, at the end of Run 2, triggering fully automatic recovery without human intervention. DAQ Expert analyses the real-time monitoring data originating from the DAQ components and the high-level trigger updated every few seconds. It pinpoints data flow problems, and recovers them automatically or after given operator approval. We analyse the CMS downtime in the 2018 run focusing on what was improved with the introduction of automated recovery; present challenges and design of encoding the expert knowledge into automated recovery jobs. Furthermore, we demonstrate the web-based, ReactJS interfaces that ensure an effective cooperation between the human operators in the control room and the automated recovery system. We report on the operational experience with automated recovery.

Highlights

  • The Compact Muon Solenoid (CMS)[1] Data Acquisition (DAQ) system is responsible for reading out the data from one of the two general purpose experiments at the Large Hadron Collider (LHC)

  • The datataking efficiency of CMS was 95.87% uptime in 2018. This is measured as a percentage of system uptime during total time of Stable Beams delivered by the LHC

  • Automatic recovery is a functionality of the DAQExpert system

Read more

Summary

Introduction

The Compact Muon Solenoid (CMS)[1] Data Acquisition (DAQ) system is responsible for reading out the data from one of the two general purpose experiments at the Large Hadron Collider (LHC). The accelerator complex provides proton-proton bunch crossings at a rate of 40 MHz, and the average size of each collision event is 1-2 MB. A two level trigger is in place in order to select only the most interesting data for storage and further analysis. A hardware trigger selects the events at a rate of 100 kHz. Full events are read out and built from all detector electronics yielding a throughput of 200 GB/s. The High Level Trigger farm of 35 000 cores reduces the event rate to O(1 kHz). In order to minimize the downtime of the system, various recovery procedures have been prepared by the system experts. The operator crew, rotating in the control room 24/7, supervises the data taking and follows recovery procedures if needed. During LHC Run-1 and LHC Run-2 various automation mechanisms were introduced into the system [2, 3]

DAQExpert
Impact
The Human Factor
Automatic Recovery
Architecture
First Recovery
Findings
Summary

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.