Reducing the number of response time service level objective violations by a cloud‐HPC convergence scheduler

Alessandro Kraemer,Carlos Maziero,Olivier Richard,Denis Trystram

doi:10.1002/cpe.4352

Abstract

SummaryJob scheduling is an old topic in High‐Performance Computing (HPC), and it is more and more studied in data centers. Large data centers are often split into separate partitions for cloud computing and HPC; each partition normally has its specific scheduler. The possibility of migrating jobs from the HPC partition to the cloud one is a topic widely discussed in the literature. However, job migration from cloud to HPC is a much less explored topic. Nevertheless, such migration may be useful in many situations, in particular when the HPC platform has a low resource usage level, and the cloud usage level is high. A large number of jobs that could migrate from the cloud to the HPC partition may be observed in Google data center workloads. Job scheduling using overbooking strategy is seen as the main reason for the high resource usage level in clouds. However, overbooking can lead to a high rate of rescheduling and job dumping, which potentially causes response time violations. This work shows that HPC platforms can host and execute some cloud jobs with low interference in HPC jobs and a low number of response time violations. We introduce the definition of a cloud‐HPC convergence area and propose a job scheduling strategy for it, aiming at reducing the number of response time violations of cloud jobs without interfering with HPC jobs execution. Our proposal is formally defined and then evaluated in different execution scenarios, using the SimGrid simulation framework, with workload data from production HPC grid. The experimental results show that often, there is a large number of empty areas in the scheduling plan of HPC platforms, which makes it possible to allocate cloud jobs by backfilling. This is due to the sparse HPC job submission pattern and the low resource usage level in some HPC platforms. One performed simulation scenario considered a set of 11K parallel HPC jobs running on a 2560‐processor platform having an average resource usage level of 38.0%. The proposed convergence scheduler succeeded to inject around 267K cloud jobs in the HPC platform, with a response time violation rate under 0.00094% for such jobs, considering 80 processors in the convergence area and no effects on the HPC workload. This experiment considered cloud jobs based on job features of Google public cloud workloads, with a processing time slack factor of 1.25 (which is considered as high priority in the Google cloud SLA—Service Level Agreement). Usually, most cloud jobs show a slack factor higher than 1.25 (most cloud jobs are medium or low priority). The same simulation, repeated with a higher slack factor (4), showed no response time violations.

Full Text