Predicting running time of aerodynamic jobs in HPC system by combining supervised and unsupervised learning method

Hao Wang,Yong Dong,Yi-Qin Dai,Jie Yu

doi:10.1186/s42774-021-00077-8

Abstract

Improving resource utilization is an important goal of high-performance computing systems of supercomputing centers. To meet this goal, the job scheduler of high-performance computing systems often uses backfilling scheduling to fill short-time jobs into job gaps at the front of the queue. Backfilling scheduling needs to obtain the running time of the job. In the past, the job running time is usually given by users and often far exceeded the actual running time of the job, which leads to inaccurate backfilling and a waste of computing resources. In particular, when the predicted job running time is lower than the actual time, the damage caused to the utilization of the system’s computing resources becomes more serious. Therefore, the prediction accuracy of the job running time is crucial to the utilization of system resources. The use of machine learning methods can make more accurate predictions of the job running time. Aiming at the parallel application of aerodynamics, we propose a job running time prediction framework SU combining supervised and unsupervised learning and verify it on the real historical data of the high-performance computing systems of China Aerodynamics Research and Development Center (CARDC). The experimental results show that SU has a high prediction accuracy (80.46%) and a low underestimation rate (24.85%).

Highlights

High-performance computing [1] has been widely used in the fields of science and engineering
To improve the utilization of computing resources, the job scheduler usually adopts a backfilling strategy, which schedules short-time jobs at the back of the job queue in advance if these jobs don’t delay the execution of the first job in the queue. In this process, the job scheduler captures the following job information: the submission time of the job (Submit_time), the number of CPU cores required for the job to complete (CPU_req), the job name (Job_name), the user name (User), the user ID (User_id), the job waiting time (Wait_time) and estimated time for the job (Time_req), etc
We propose the SU, a job time prediction framework, which combines the advantages of supervised learning methods and unsupervised learning methods and lays a foundation for the job running time prediction;

Summary

Introduction

High-performance computing [1] has been widely used in the fields of science and engineering. To improve the utilization of computing resources, the job scheduler usually adopts a backfilling strategy, which schedules short-time jobs at the back of the job queue in advance if these jobs don’t delay the execution of the first job in the queue. In this process, the job scheduler captures the following job information: the submission time of the job (Submit_time), the number of CPU cores required for the job to complete (CPU_req), the job name (Job_name), the user name (User), the user ID (User_id), the job waiting time (Wait_time) and estimated time for the job (Time_req), etc. For the system, this approach is not beneficial and can cause a lot of waste of computing resources

Methods

Results

Conclusion