An Adaptive Middleware Framework for Optimal Scheduling on Large Scale Compute Clusters

Ian Gorton,Christopher Oehmen,Arzu Gosney,John H Miller

doi:10.1109/itng.2011.126

Abstract

In production multi-user high-performance (HPC) batch computing environments, wait times for scheduled jobs are highly dynamic. For scientific users, the primary measure of efficiency is wall clock time-to-solution. In high throughput applications, such as many kinds of biological analysis, the computational work to be done can be flexibly scheduled taking a longer time on a small number of processors or a shorter time on a large number of processors. Therefore the capability to choose a platform at run-time based on both processing capabilities and availability (lowest wait time) would be attractive. The goal of our work was to create an adaptive interface to HPC systems that dynamically reschedules high-throughput calculations in response to fluctuating load, optimizing for time-to-solution. This was done by implementing middleware functionality to (1) monitor the resource load on a given compute cluster, (2) generate a plan, checking on the applicability of the plan with the defined goals and (3) adaptively choosing the optimal job dimensions (number of processors and wall-clock time) to provide the best time-to-solution results.

Full Text