DAG-based workflows scheduling using Actor–Critic Deep Reinforcement Learning

Guilherme Piêgas Koslovski,Kleiton Pereira,Paulo Roberto Albuquerque

doi:10.1016/j.future.2023.09.018

Abstract

High-Performance Computing (HPC) is essential to support the advance in multiple research and industrial fields. Despite the recent growth in processing and networking power, the HPC Data Centers (DCs) are finite, and should be carefully managed to host multiple jobs. The scheduling of tasks (composing a job) is a crucial and complex task, once the reflexes of the scheduler’s decisions are perceptible both for users (e.g., slowdown) and for infrastructure administrators (e.g., use of resources and queue length). In fact, the process of scheduling workflows atop a DC can be modeled as a graph mapping problem. While an undirected graph is used to represent the DC, a Directed Acyclic Graph (DAG) is used to express the tasks dependencies. Each vertex and edge from both graphs can have weights associated with them, denoting the residual capacities for DC resources, as well as computing and networking demands for workflows. Motivated by the combinatorial explosion of the aforementioned scheduling problem, the integration of Machine Learning (ML) for generating or improving scheduling policies is a reality, however the proposals in the specialized literature opt, mostly, for using simplified models to reduce the search space or are trained to specific scenarios, which leads to policies that eventually fall short of real DCs expectations. Given this challenge, this work applies Actor–Critic (AC) Reinforcement Learning (RL) to schedule DAG-based workflows. Instead of proposing a new policy, the AC RL is used to select the appropriated scheduling policy from a pool of consolidated algorithms, guided by the DAGs workload and DC usage. The AC RL-based scheduler analyzes the DAGs queue and the DC status to define which algorithms are better suited to improve the overall performance indicators in each scenario instance. The simulation protocol comprises multiple analysis with distinct workload configurations, number of jobs, queue ordering polices and strategies to select the target DC servers. The results demonstrated that the AC RL selects the scheduling policy which fits the current workload and DC status.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DAG-based workflows scheduling using Actor–Critic Deep Reinforcement Learning

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems

Lead the way for us

Journal: Future Generation Computer Systems	Publication Date: Sep 15, 2023
Citations: 5

Similar Papers

DWPE, a new data center energy-efficiency metric bridging the gap between infrastructure and workload
Torsten Wilde ... Axel Auweter
-
Torsten Wilde, et. al.Torsten Wilde ... Axel Auweter
01 Jul 2014
01 Jul 2014

Intra- and Inter-Server Smart Task Scheduling for Profit and Energy Optimization of HPC Data Centers
Sayed Ashraf Mamun ... Alexander Gilday
Journal of Low Power Electronics and Applications | VOL. 10
Sayed Ashraf Mamun, et. al.Sayed Ashraf Mamun ... Alexander Gilday
14 Oct 2020
Journal of Low Power Electronics and Applications | VOL. 10

Integrating cooling awareness with thermal aware workload placement for HPC data centers
Ayan Banerjee ... Sandeep K.S Gupta
Sustainable Computing: Informatics and Systems | VOL. 1
Ayan Banerjee, et. al.Ayan Banerjee ... Sandeep K.S Gupta
16 Mar 2011
Sustainable Computing: Informatics and Systems | VOL. 1

Investigative Report on Electrical Commissioning in HPC Data Centers
Joseph Prisco ... Brandon Hong
-
Joseph Prisco, et. al.Joseph Prisco ... Brandon Hong
01 Sep 2020
01 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DAG-based workflows scheduling using Actor–Critic Deep Reinforcement Learning

Abstract

Talk to us

Similar Papers

More From: Future Generation Computer Systems