Autonomous System Management for the ALICE High-Level-Trigger Cluster using the SysMES framework

Stefan Boettger,Pierre Zelnicek,Timo Breitner,Udo Kebschull,Jochen Ulrich,Camilo Lara

doi:10.1088/1742-6596/331/5/052003

Stefan Boettger, Pierre Zelnicek + Show 4 more

Open Access

https://doi.org/10.1088/1742-6596/331/5/052003

Copy DOI

Abstract

The ALICE HLT cluster is a heterogeneous computer cluster currently consisting of 200 nodes. This cluster is used for on-line processing of data produced by the ALICE detector during the next 10 or more years of operation. A major management challenge is to reduce the number of manual interventions in case of failures. Classical approaches like monitoring tools lack mechanisms to detect situations with multiple failure conditions and to automatically react to such situations. We have therefore developed SysMES (System Management for networked Embedded Systems and Clusters), a decentralized, fault tolerant, tool-set for autonomous management. It comprises a monitoring facility for detecting the working states of the distributed resources, a central interface for visualizing and managing the cluster environment and a rule system for coupling of the monitoring and management aspects. We have developed a formal language by which an administrator can define complex spatial and temporal conditions for failure states and according reactions. For the HLT we have defined a set of rules for known and recurring problem states such that SysMES takes care of most of day-to-day administrative work.

Full Text