Abstract

There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is a High Throughput Computing (HTC) cluster with HTCondor as the workload manager, the other is a High Performance Computing (HPC) cluster with Slurm as the workload manager. The resources of the HTCondor cluster are funded by multiple experiments, and the resource utilization reached more than 90% by adopting a dynamic resource share mechanism. Nevertheless, there is a bottleneck if more resources are requested by multiple experiments at the same moment. On the other hand, parallel jobs running on the Slurm cluster reflect some specific attributes, such as high degree of parallelism, low quantity and long wall time. Such attributes make it easy to generate free resource slots which are suitable for jobs from the HTCondor cluster. As a result, if there is a mechanism to schedule jobs from the HTCon-dor cluster to the Slurm cluster transparently, it would improve the resource utilization of the Slurm cluster, and reduce job queue time for the HTCondor cluster. In this proceeding, we present three methods to migrate HTCondor jobs to the Slurm cluster, and concluded that HTCondor-C is more preferred. Furthermore, because design philosophy and application scenes are di↵erent between HTCondor and Slurm, some issues and possible solutions related with job scheduling are presented.

Highlights

  • IntroductionThe resource utilization ratio of the HTCondor cluster has reached more than 90% , which means it has reached the bottleneck of resource provision for now

  • There are two local computing clusters in the Institute of High Energy Physics(IHEP), one is a HTCondor cluster, the other is a Slurm cluster

  • Most jobs running on the HTCondor cluster are single-core jobs, while parallel and multi-core jobs are running on the Slurm cluster

Read more

Summary

Introduction

The resource utilization ratio of the HTCondor cluster has reached more than 90% , which means it has reached the bottleneck of resource provision for now. The workload of the Slurm cluster is relatively not heavy, and the resource utilization ration is 50% on average. If HTCondor jobs could be migrated and run on Slurm cluster, users of the HTCondor cluster could have more resources to run their jobs, and resource utilization ratio of the Slurm cluster would be increased at the same time. To testify that workload integration between HTCondor and Slurm clusters is feasible, section 2 lists and compares related and similar works.

Related works
Job migration
Overlap
Flocking
HTCondor-C
The issue of large job quantity
The issue of resource sharing
The issue of system environment
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call