Abstract

A typical HEP Computing Center normally runs at least one batch system. As an example, at IHEP (Institute of High Energy Physics, Chinese Academy of Sciences), we’ve used three batch systems: PBS, HTCondor and SLURM. After running PBS as a local batch system for 10 years, we replaced it by HTCondor (for HTC) and SLURM (for HPC). During that period, problems came up on both user and admin sides. Introduction of the new batch systems implies necessity for users to acquire additional knowledge specific for every batch system, in particular, batch commands. In some cases, users have to use both HTCondor and SLURM in parallel. Furthermore, HTCondor and SLURM provide more functionality, which means more complicated usage mode, compared to the simple PBS commands. On admin side, HTCondor gives more freedom to users, which brings an additional challenge to site administrators. Site administrators have to find the solutions for many problems: preventing users from requesting the resources they are not allowed to use, checking if the required attributes are correct, deciding where requested resources are located (SLURM cluster, the cluster of the virtual machines, the remote sites, etc). To meet the above requirements, HepJob was designed and developed. HepJob provides a set of simple user commands, for example: hep_sub, hep_q, hep_rm, etc. In the submission process, HepJob checks all the attributes and ensures all attributes are correct; assigns proper resources to users (the user and group info is obtained from the management database); routes jobs to the target site; performs other steps as required. Users can start with HepJob very easily and administrators can take the necessary management actions in HepJob.

Highlights

  • 1.1 BackgroundBy the end of 2019, IHEP computing resources are serving more than 2800 users from above 13 experiments

  • In total three batch systems were provided to users, namely PBS[1], HTCondor[2] and SLURM[3]

  • A new HPC cluster managed by SLURM is being expanded

Read more

Summary

Background

By the end of 2019, IHEP computing resources are serving more than 2800 users from above 13 experiments. In total three batch systems were provided to users, namely PBS[1], HTCondor[2] and SLURM[3]. Many users developed customer scripts for PBS which have to be adapted for the new batch systems. Overall, this implied a lot of changes both for users and experiments. The interface to batch systems becomes transparent, users don’t need to learn HTCondor and SLURM, they only interface the new frontend tool. The frontend tool can perform some checks before a job is submitted This would be helpful to prevent jobs from being interrupted due to issues with file permissions of the job script or user/group permissions of a given resource. The details of HepJob will be described

Overview of HepJob
Processing for HTCondor
Plugins for SLURM
Performance Tests and Current Status
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.