Abstract

HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016. In 2017 a Slurm cluster was set up to run High Performance Computing (HPC) jobs. To provide accounting services for these two clusters, we implemented a unified accounting system named Cosmos. Multiple workloads bring different accounting requirements. Briefly speaking, there are four types of jobs to account. First of all, 30 million single-core jobs run in the HTCondor cluster every year. Secondly, Virtual Machine (VM) jobs run in the legacy HTCondor VM cluster. Thirdly, parallel jobs run in the Slurm cluster, and some of these jobs are run on the GPU worker nodes to accelerate computing. Lastly, some selected HTC jobs are migrated from the HTCondor cluster to the Slurm cluster for research purposes. To satisfy all the mentioned requirements, Cosmos is implemented with four layers: acquisition, integration, statistics and presentation. Details about the issues and solutions of each layer will be presented in the paper. Cosmos has run in production for two years, and the status shows that it is a well-functioning system, also meets the requirements of the HTCondor and Slurm clusters.

Highlights

  • HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016

  • Parallel jobs run in the Slurm cluster, and some of these jobs are run on the GPU worker nodes to accelerate computing

  • It is not fast enough to get job accounting information with history files, neither Slurm accounts jobs in our favored way. Because both the HTCondor and Slurm clusters are managed by one administrator group, it would be convenient for managing clusters if a unified accounting system was provided

Read more

Summary

Introduction

HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016. A Slurm cluster was established to manage High Performance Computing (HPC) jobs Both the HTCondor and Slurm clusters provide native job accounting services, it is still not enough to meet our demands. It is not fast enough to get job accounting information with history files, neither Slurm accounts jobs in our favored way Because both the HTCondor and Slurm clusters are managed by one administrator group, it would be convenient for managing clusters if a unified accounting system was provided. Cosmos is implemented as a four layer system: acquisition, integration, statistics and presentation These four layers work together to generate monthly accounting invoices for users and groups, and to check the status of clusters for administrators

Related Works
Design and Implementation
The Integration Layer
The Statistical Layer
The Presentation Layer
System Status
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call