Abstract
HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016. In 2017 a Slurm cluster was set up to run High Performance Computing (HPC) jobs. To provide accounting services for these two clusters, we implemented a unified accounting system named Cosmos. Multiple workloads bring different accounting requirements. Briefly speaking, there are four types of jobs to account. First of all, 30 million single-core jobs run in the HTCondor cluster every year. Secondly, Virtual Machine (VM) jobs run in the legacy HTCondor VM cluster. Thirdly, parallel jobs run in the Slurm cluster, and some of these jobs are run on the GPU worker nodes to accelerate computing. Lastly, some selected HTC jobs are migrated from the HTCondor cluster to the Slurm cluster for research purposes. To satisfy all the mentioned requirements, Cosmos is implemented with four layers: acquisition, integration, statistics and presentation. Details about the issues and solutions of each layer will be presented in the paper. Cosmos has run in production for two years, and the status shows that it is a well-functioning system, also meets the requirements of the HTCondor and Slurm clusters.
Highlights
HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016
Parallel jobs run in the Slurm cluster, and some of these jobs are run on the GPU worker nodes to accelerate computing
It is not fast enough to get job accounting information with history files, neither Slurm accounts jobs in our favored way. Because both the HTCondor and Slurm clusters are managed by one administrator group, it would be convenient for managing clusters if a unified accounting system was provided
Summary
HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016. A Slurm cluster was established to manage High Performance Computing (HPC) jobs Both the HTCondor and Slurm clusters provide native job accounting services, it is still not enough to meet our demands. It is not fast enough to get job accounting information with history files, neither Slurm accounts jobs in our favored way Because both the HTCondor and Slurm clusters are managed by one administrator group, it would be convenient for managing clusters if a unified accounting system was provided. Cosmos is implemented as a four layer system: acquisition, integration, statistics and presentation These four layers work together to generate monthly accounting invoices for users and groups, and to check the status of clusters for administrators
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.