Abstract

The online farm of the ATLAS experiment at the LHC, consisting of nearly 4000 PCs with various characteristics, provides configuration and control of the detector and performs the collection, processing, selection, and conveyance of event data from the front-end electronics to mass storage. Different aspects of the farm management are already accessible via several tools. The status and health of each node are monitored by a system based on Icinga 2 and Ganglia. PuppetDB gathers centrally all the status information from Puppet, the configuration management tool used to ensure configuration consistency of every node. The in-house Configuration Database (ConfDB) controls DHCP and PXE, while also integrating external information sources. In these proceedings we present our roadmap for integrating these and other data sources and systems, and building a higher level of abstraction on top of this foundation. An automation and orchestration tool will be able to use these systems and replace lengthy manual procedures, some of which also require interactions with other systems and teams, e.g. for the repair of a faulty node. Finally, an inventory and tracking system will complement the available data sources, keep track of node history, and improve the evaluation of long-term lifecycle management and purchase strategies.

Highlights

  • The online farm of the ATLAS [1] experiment at the LHC consists of nearly 4000 nodes with various characteristics

  • Configuration Database (ConfDB) manages the status of the node and function (TDAQ, Sim@P1 [8], etc.)

  • OKS [6] [12] is a library to support a simple, active persistent in-memory object manager. It is used as the frame of the configuration database to provide the overall description of the Data Acquisition (DAQ) system, the trigger and detectors software and hardware

Read more

Summary

Introduction

The online farm of the ATLAS [1] experiment at the LHC consists of nearly 4000 nodes with various characteristics. Due to the large scale of the farm and the variety of the systems, appropriate tools to address various requirements are needed to effectively manage [2] and monitor these nodes [3]. This is a time consuming process, and the expert must remember to update all the tools in the correct order (as per the defined procedures). A procedure may require the expert to constantly monitor the status of the node to determine when it is ready for an intervention and this results in an ineffective workflow

Tools overview
Configuration Database
Monitoring
OKS - Object Kernel Support
Implementation
Schedule Downtime
Results
Inventory and tracking system
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.