Abstract

The scheduler is one of the most important components of a high performance cluster. This paper introduces a self-adaptive dispatching system (SAPS) based on Torque[1] and Maui[2]. It promotes cluster resource utilization and improves the overall speed of tasks. It provides some extra functions for administrators and users. First of all, in order to allow the scheduling of GPUs, a GPU scheduling module based on Torque and Maui has been developed. Second, SAPS analyses the relationship between the number of queueing jobs and the idle job slots, and then tunes the priority of users’ jobs dynamically. This means more jobs run and fewer job slots are idle. Third, integrating with the monitoring function, SAPS excludes nodes in error states as detected by the monitor, and returns them to the cluster after the nodes have recovered. In addition, SAPS provides a series of function modules including a batch monitoring management module, a comprehensive scheduling accounting module and a real-time alarm module. The aim of SAPS is to enhance the reliability and stability of Torque and Maui. Currently, SAPS has been running stably on a local cluster at IHEP (Institute of High Energy Physics, Chinese Academy of Sciences), with more than 12,000 cpu cores and 50,000 jobs running each day. Monitoring has shown that resource utilization has been improved by more than 26%, and the management work for both administrator and users has been reduced greatly.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call