Abstract

The ALICE HLT cluster is a heterogeneous computer cluster currently consisting of 200 nodes. This cluster is used for on-line processing of data produced by the ALICE detector during the next 10 or more years of operation. A major management challenge is to reduce the number of manual interventions in case of failures. Classical approaches like monitoring tools lack mechanisms to detect situations with multiple failure conditions and to automatically react to such situations. We have therefore developed SysMES (System Management for networked Embedded Systems and Clusters), a decentralized, fault tolerant, tool-set for autonomous management. It comprises a monitoring facility for detecting the working states of the distributed resources, a central interface for visualizing and managing the cluster environment and a rule system for coupling of the monitoring and management aspects. We have developed a formal language by which an administrator can define complex spatial and temporal conditions for failure states and according reactions. For the HLT we have defined a set of rules for known and recurring problem states such that SysMES takes care of most of day-to-day administrative work.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.