Abstract

Scheduling multi-core workflows in a global HTCondor pool is a multi-dimensional problem whose solution depends on the requirements of the job payloads, the characteristics of available resources, and the boundary conditions such as fair share and prioritization imposed on the job matching to resources. Within the context of a dedicated task force, CMS has increased significantly the scheduling efficiency of workflows in reusable multi-core pilots by various improvements to the limitations of the GlideinWMS pilots, accuracy of resource requests, efficiency and speed of the HTCondor infrastructure, and job matching algorithms.

Highlights

  • Multi-core Pool SchedulingThe CMS Submission Infrastructure (SI) Group is responsible for GlideinWMS [1] and HTCondor [2] pool operations in the CMS experiment at CERN, as well as setting and communicating our priorities to the respective software development teams

  • The CMS Global Pool is at once a GlideinWMS instance and a HTCondor pool

  • CPU efficiency in the WLCG is defined as the measured CPU time over the wall clock time weighted by the logical CPU core count

Read more

Summary

Multi-core Pool Scheduling

The CMS Submission Infrastructure (SI) Group is responsible for GlideinWMS [1] and HTCondor [2] pool operations in the CMS experiment at CERN, as well as setting and communicating our priorities to the respective software development teams. In response to demand for resources from job schedulers (schedd’s), a GlideinWMS frontend queries job queues on the schedd’s and sends requests to several GlideinWMS factories to submit multi-core pilot jobs ( called “glideins”) to Grid and Cloud sites world-wide. These pilots instantiate HTCondor resources (a startd) that join one of the several HTCondor pools managed by the SI group. CMS schedulers flock jobs to multiple pools, the largest pool being the CMS Global Pool, which connects CMS Tier-1, Tier-2 and Tier-3 sites world-wide This pool reaches scales of 200,000 CPU cores or more. Pilots can become more fragmented over time as lower core count jobs finish asynchronously, making the matching of higher core count jobs impossible, even if they are from higher priority workflows

Scheduling Efficiency
Legitimate Use Cases
Pilot Improvements
HTCondor Improvements
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.