Cosmos : A Unified Accounting System both for the HTCondor and Slurm Clusters at IHEP

Ran Du,Jingyan Shi,Jiaheng Zou,Xiaowei Jiang,C Doglioni,D Kim,P Jackson,G.A Stewart,W Kamleh,L Silvestris

doi:10.1051/epjconf/202024507060

Ran Du, Jingyan Shi + Show 8 more

Open Access

https://doi.org/10.1051/epjconf/202024507060

Copy DOI

Abstract

HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016. In 2017 a Slurm cluster was set up to run High Performance Computing (HPC) jobs. To provide accounting services for these two clusters, we implemented a unified accounting system named Cosmos. Multiple workloads bring different accounting requirements. Briefly speaking, there are four types of jobs to account. First of all, 30 million single-core jobs run in the HTCondor cluster every year. Secondly, Virtual Machine (VM) jobs run in the legacy HTCondor VM cluster. Thirdly, parallel jobs run in the Slurm cluster, and some of these jobs are run on the GPU worker nodes to accelerate computing. Lastly, some selected HTC jobs are migrated from the HTCondor cluster to the Slurm cluster for research purposes. To satisfy all the mentioned requirements, Cosmos is implemented with four layers: acquisition, integration, statistics and presentation. Details about the issues and solutions of each layer will be presented in the paper. Cosmos has run in production for two years, and the status shows that it is a well-functioning system, also meets the requirements of the HTCondor and Slurm clusters.

Highlights

HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016
Parallel jobs run in the Slurm cluster, and some of these jobs are run on the GPU worker nodes to accelerate computing
It is not fast enough to get job accounting information with history files, neither Slurm accounts jobs in our favored way. Because both the HTCondor and Slurm clusters are managed by one administrator group, it would be convenient for managing clusters if a unified accounting system was provided

Summary

Introduction

HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016. A Slurm cluster was established to manage High Performance Computing (HPC) jobs Both the HTCondor and Slurm clusters provide native job accounting services, it is still not enough to meet our demands. It is not fast enough to get job accounting information with history files, neither Slurm accounts jobs in our favored way Because both the HTCondor and Slurm clusters are managed by one administrator group, it would be convenient for managing clusters if a unified accounting system was provided. Cosmos is implemented as a four layer system: acquisition, integration, statistics and presentation These four layers work together to generate monthly accounting invoices for users and groups, and to check the status of clusters for administrators

Related Works

Design and Implementation

The Integration Layer

The Statistical Layer

The Presentation Layer

System Status

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EPJ Web of Conferences	Publication Date: Jan 1, 2020
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Cosmos : A Unified Accounting System both for the HTCondor and Slurm Clusters at IHEP

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences

Lead the way for us

Similar Papers

Virtual Organization Clusters
Michael A Murphy ... Michael Fenn
-
Michael A Murphy, et. al.Michael A Murphy ... Michael Fenn
01 Feb 2009
01 Feb 2009

Integrating HPC into an agile and cloud-focused environment at CERN
Pablo Llopis ... Philippe Ganz
EPJ Web of Conferences | VOL. 214
Pablo Llopis, et. al.Pablo Llopis ... Philippe Ganz
01 Jan 2019
EPJ Web of Conferences | VOL. 214

Improvements of common open Grid standards to increase High Throughput and High Performance Computing effectiveness on large-scale Grid and e-science infrastructures
M Riedel ... A Streit
-
M Riedel, et. al.M Riedel ... A Streit
01 Apr 2010
01 Apr 2010

슈퍼컴퓨팅환경에서의 대규모 계산 작업 처리 기술 연구
Seok-Kyoo Kim ... Jik-Soo Kim
The Journal of the Korea Contents Association | VOL. 14
Seok-Kyoo Kim, et. al.Seok-Kyoo Kim ... Jik-Soo Kim
28 May 2014
The Journal of the Korea Contents Association | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cosmos : A Unified Accounting System both for the HTCondor and Slurm Clusters at IHEP

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EPJ Web of Conferences