PREP: Predicting Job Runtime with Job Running Path on Supercomputers

Longfang Zhou,Xiaorong Zhang,Yadong Wu,Wenxiang Yang,Yongguo Han,Fang Wang,Jie Yu

doi:10.1145/3472456.3473521

Abstract

Supercomputers serve a lot of parallel jobs by scheduling jobs and allocating computing resources. One popular scheduling strategy is First Come First Serve (FCFS). However, there are always some idle resources not being effectively utilized, since they are not enough and are reserved for the head job in the waiting queue. To improve resource utilization, a common solution is to use backfilling, which allocates the reserved computing resources to a small, short job selected from the queue, on the premise of not delaying the original head job. Unfortunately, the estimated job runtime provided by users is often overestimated. Previous studies extract features from historical job logs and predict runtime based on machine learning. However, traditional features (e.g. CPU, user, submitting time, etc.) are insufficient to describe the characteristics of jobs. In this paper, we propose a novel runtime prediction framework called PREP. It explores a new feature named job running path, which encodes important implications about the job’s characteristics, such as the project it belongs to, data sets and parameters it uses, etc. As there is a strong correlation between job runtime and its running path. PREP groups jobs into separate clusters according to their running paths and trains a runtime prediction model for each job cluster. Final results demonstrate that adding the new feature can achieve high prediction accuracy of 88% and has a better prediction effect than other methods, such as Last-2 and IRPA.

Full Text