Abstract
HTCondor has been widely adopted by HEP clusters to provide high-level scheduling performance. Unlike other schedulers, HTCondor provides loose management of the worker nodes. We developed a maintenance automation tool called “HTCondor MAT” that focuses on dynamic resource management and automatic error handling. A central database records all worker node information, which is sent to the worker node for the startd configuration. If an error happens for the worker node, the node information stored in the database is updated and the worker node is reconfigured with the new node information. The new configuration stops the startd from accepting error-related jobs until the worker node recovers. The MAT has been deployed in the IHEP HTC cluster to provide a central way to manage the worker nodes and remove the impacts of errors on the worker nodes automatically.
Highlights
The Institute of High Energy Physics in China runs a 17,000 CPU core HTCondor cluster supporting more than 10 HEP experiments such as BESIII[1], LHAASO[2], JUNO[3], ATLAS[4], CMS[5], etc
We developed a maintenance automation tool called “MAT” that automatically adjusts the worker node attribute to stop accepting the jobs that would fail due to an error
The receiver reconfigures the “startd” service with the newly generated configuration file based on the “Linux group list” message that is received from the pusher
Summary
The experiments’ “Linux group” is set to the “START” attribute by the administrator based on the experiment’s requirements. Most of the worker nodes are shared by all the experiments and a small part of them are set to run some dedicated tasks. This is done by adjusting the worker node attribute “START”. An unexpected error that happens to the worker node might cause a black hole worker node to cause the queueing job to fail. We developed a maintenance automation tool called “MAT” that automatically adjusts the worker node attribute to stop accepting the jobs that would fail due to an error
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.