Abstract

We describe a method for queue wait time prediction in supercomputing clusters. It was designed for use as a part of multi-criteria brokering mechanisms for resource selection in a multi-site High Performance Computing environment. The aim is to incorporate the time jobs stay queued in the scheduling system into the selection criteria. Our method can also be used by the end users to estimate the time to completion of their computing jobs. It uses historical data about the particular system to make predictions. It returns a list of probability estimates of the form (ti, pi), where pi is the probability that the job will start before time ti. Times ti can be chosen more or less freely when deploying the system. Compared to regression methods that only return a single number as a queue wait time estimate (usually without error bars) our prediction system provides more useful information. The probability estimates are calculated using the Bayes theorem with the naive assumption that the attributes describing the jobs are independent. They are further calibrated to make sure they are as accurate as possible, given available data. We describe our service and its REST API and the underlying methods in detail and provide empirical evidence in support of the method's efficacy.This article is part of the theme issue ‘Multiscale modelling, simulation and computing: from the desktop to the exascale’.

Highlights

  • The issue of queue wait times comes up in many situations in High Performance Computing (HPC)

  • We describe a method for queue wait time prediction in supercomputing clusters

  • Underused systems are unlikely to be interesting in terms of queue wait time predictions

Read more

Summary

Introduction

The issue of queue wait times comes up in many situations in High Performance Computing (HPC). The multi-criteria approach to resource selection and the practical use of queue wait time prediction needed to estimate the time to finish was the initial inspiration and the main motivation for the work described in this paper. (i) The Pattern Optimization Service, based on the knowledge of the application itself and the static information about the infrastructure, generates a list of assignment plans determining the search space for the optimal allocation of resources to computational kernels (parts of the multi-scale application) This list is submitted to the QCG-Broker service as a part of the job description with both the requirements for computing resources and definition of optimization criteria and limits. Gao et al [13] use a genetic algorithm to optimize job scheduling, their approach does not apply to us due to the fact that we need to estimate job queue wait time probabilities.

Naive Bayes for queue wait time prediction
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call