GARLSched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems

Jingbo Li,Xingjun Zhang,Jia Wei,Zeyu Ji,Zheng Wei

doi:10.1016/j.future.2022.04.032

Abstract

Efficient task scheduling has become increasingly complex as the number and type of tasks proliferate and the size of computing resource grows in large-scale distributed high-performance computing (HPC) systems. At present, deep reinforcement learning (DRL) methods have achieved certain success in scheduling problems. However, due to the exogeneity of the task and the sparsity of the reward, the learning of the DRL control policy requires a significant amount of training time and data and cannot guarantee effective convergence. Meanwhile, based on the understanding of HPC system characteristics, various scheduling policies with acceptable performance for different optimization goals have been developed by the experts. But these heuristic methods cannot adapt to environmental changes and optimize for specific workloads. Therefore, the generative adversarial reinforcement learning scheduling (GARLSched) algorithm is proposed to effectively guide the learning of DRL in large-scale dynamic scheduling issues based on the optimal policy in the expert pool. In addition, the task embedding-based discriminator network effectively improves and stabilizes the learning process. Experiments show that compared with heuristic and DRL scheduling algorithms, GARLSched can learn high-quality scheduling policies for various workloads and optimization objects. Furthermore, the learned models can perform stably even when applied to invisible workloads, making them more practical in HPC systems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

GARLSched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems

Lead the way for us

Journal: Future Generation Computer Systems	Publication Date: May 12, 2022
Citations: 9

Similar Papers

Design of robust scheduling methodologies for high performance computing

-

01 Jan 2019
01 Jan 2019

DRLFluent: A distributed co-simulation framework coupling deep reinforcement learning with Ansys-Fluent on high-performance computing systems
Yiqian Mao ... Hujun Yin
Journal of Computational Science | VOL. 74
Yiqian Mao, et. al.Yiqian Mao ... Hujun Yin
28 Oct 2023
Journal of Computational Science | VOL. 74

Code Modernization Tools for Assisting Users in Migrating to Future Generations of Supercomputers
Ritu Arora ... Lars Koesterke
-
Ritu Arora, et. al.Ritu Arora ... Lars Koesterke
01 Jan 2017
01 Jan 2017

Multi-node Power/Performance Modeling for HPC System
Sangwoo Han ... Eui-Young Chung
-
Sangwoo Han, et. al.Sangwoo Han ... Eui-Young Chung
01 Jun 2019
01 Jun 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

GARLSched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems