Abstract
This Thesis deals with the problem of scheduling applications on High-Performance Computing (HPC) machines. The goal is to create a scheduler that can improve the solutions w.r.t. the state-of-the-art under different metrics. However, improving the solution quality is not enough: creating a scheduler for future HPC machines requires to take into account also overheads and scalability. In this thesis we present a comprehensive, scalable, scheduling approach that features both an off-line and an on-line component. The off-line component is based on Constraint Programming (CP), an optimization technique that is well-suited for scheduling problems and allows for great flexibility. We leverage this flexibility to present first a optimization method designed to optimize the job waiting times, which is then extended via heuristics and search strategies to deal with more complex objective functions. Unfortunately, such a complex objective function cannot be handled by a solver in an acceptable amount of time for online operation on a HPC machine in-production. We deal with this difficulty by making use of a second, distributed, on-line scheduler. This second scheduler is designed to dramatically decrease the computational overhead and achieve a scalability adequate to future ExaFlops HPC machines. The distributed scheduler is proactive, and it takes decisions so as to follow a desirable, pre-specified, utilization profile. This feature makes it possible to connect these two schedulers to create a hybrid system: the CP component computes the scheduling on a trace of forecasted jobs one day ahead, machine learning techniques extract from the solution a near-optimal and desirable utilization profile, and the online scheduler takes care of the actual scheduling decisions in a scalable fashion. The resulting architecture manages to improve the HPC machine profit by an average 8.6%, while decreasing the computational overhead and, under normal conditions, without any side effect.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have