Abstract
In the past few years, with the rapid development of CPU-GPU heterogeneous computing, the issue of task scheduling in the heterogeneous cluster has attracted a great deal of attention. This problem becomes more challenging with the need for efficient co-execution of tasks on the GPUs. However, the uncertainty of heterogeneous cluster and the interference caused by resource contention among co-executing tasks can lead to the unbalanced use of computing resource and further cause the degradation in performance of computing platform. In this article, we propose a two-stage task scheduling approach for streaming applications based on deep reinforcement learning and neural collaborative filtering, which considers fine-grained task division and task interference on the GPU. Specifically, the Learning-Driven Workload Parallelization (LDWP) method selects an appropriate execution node for the mutually independent tasks. By using the deep Q-network, the cluster-level scheduling model is online learned to perform the current optimal scheduling actions according to the runtime status of cluster environments and characteristics of tasks. The Interference-Aware Workload Parallelization (IAWP) method assigns subtasks with dependencies to the appropriate computing units, taking into account the interference of subtasks on the GPU by using neural collaborative filtering. For making the learning of neural network more efficient, we use pre-training in the two-stage scheduler. Besides, we use transfer learning technology to efficiently rebuild task scheduling model referring to the existing model. We evaluate our learning-driven and interference-aware task scheduling approach on a prototype platform with other widely used methods. The experimental results show that the proposed strategy can averagely improve the throughout for distributed computing system by 26.9 percent and improve the GPU resource utilization by around 14.7 percent.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Parallel and Distributed Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.