Probabilistic Reservation Services for Large-Scale Batch-Scheduled Systems

Daniel Nurmi,Rich Wolski,John Brevik

doi:10.1109/jsyst.2008.2011303

Abstract

In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized batch scheduler that implements a site-specific, and often partially hidden, policy designed to maximize machine utilization while providing tolerable turnaround times. In practice, while most HPC systems experience good utilization levels, the amount of time experienced by individual jobs waiting to begin execution has been shown to be highly variable and difficult to predict, leading to user confusion and/or frustration. One method for dealing with this uncertainty that has been proposed is the ability to predict the amount of time that individual jobs will wait in batch queues once they are submitted, thus allowing a user to reason about the total time between job submission and job completion (which we term a job's ldquooverall turnaround timerdquo). Another related but distinct method for handling the uncertainty is to allow users who are willing to plan ahead to make ldquoadvanced reservationsrdquo for processor resources, again allowing them to reason about job turnaround time. To date, however, few if any HPC centers provide either job-queue delay prediction services or advanced reservation capabilities to their general user populations. In this paper, we describe QBETS, VARQ, and CO-VARQ, new methods for allowing users to reason and control the overall turnaround time of their batch-queue jobs submitted to busy HPC systems in existence today. QBETS is an online, non-parametric system for predicting statistical bounds on the amount of time individual batch jobs will wait in queue. VARQ is a new method for job scheduling that provides users with probabilistic ldquovirtualrdquo advanced reservations using only existing best effort batch schedulers and policies, and CO-VARQ utilizes this capability to implement a general coallocation service. QBETS, VARQ and CO-VARQ operate as overlays, requiring no modification to the local scheduler implementation or policies. We describe the statistical methods we use to implement the systems, detail empirical evaluations of their effectiveness in a number of HPC settings, and explore the potential future impact of these systems should they become widely used.

Full Text