Abstract

The Trigger and Data Acquisition (TDAQ) system of the ATLAS experiment is a very complex distributed computing system, composed of more than 20000 applications running on more than 2000 computers. The TDAQ Controls system has to guarantee the smooth and synchronous operations of all the TDAQ components and has to provide the means to minimize the downtime of the system caused by runtime failures. During data taking runs, streams of information messages sent or published by running applications are the main sources of knowledge about correctness of running operations. The huge flow of operational monitoring data produced is constantly monitored by experts in order to detect problems or misbehaviours. Given the scale of the system and the rates of data to be analyzed, the automation of the system functionality in the areas of operational monitoring, system verification, error detection and recovery is a strong requirement. To accomplish its objective, the Controls system includes some high-level components which are based on advanced software technologies, namely the rule-based Expert System and the Complex Event Processing engines. The chosen techniques allow to formalize, store and reuse the knowledge of experts and thus to assist the shifters in the ATLAS control room during the data-taking activities.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.