A unified schedule policy of distributed machine learning framework for CPU-GPU cluster

Ziyu Zhu,Xiaochun Tang,Quan Zhao

doi:10.1051/jnwpu/20213930529

Abstract

With the widespread using of GPU hardware facilities, more and more distributed machine learning applications have begun to use CPU-GPU hybrid cluster resources to improve the efficiency of algorithms. However, the existing distributed machine learning scheduling framework either only considers task scheduling on CPU resources or only considers task scheduling on GPU resources. Even considering the difference between CPU and GPU resources, it is difficult to improve the resource usage of the entire system. In other words, the key challenge in using CPU-GPU clusters for distributed machine learning jobs is how to efficiently schedule tasks in the job. In the full paper, we propose a CPU-GPU hybrid cluster schedule framework in detail. First, according to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into data fragments of different sizes to adapt to CPU and GPU computing resources. Second, the paper introduces the task scheduling method under the CPU-GPU hybrid. Finally, the proposed method is verified at the end of the paper. After our verification for K-Means, using the CPU-GPU hybrid computing framework can increase the performance of K-Means by about 1.5 times. As the number of GPUs increases, the performance of K-Means can be significantly improved.

Highlights

With the widespread using of GPU hardware facilities, more and more distributed machine learning ap⁃ plications have begun to use CPU⁃GPU hybrid cluster resources to improve the efficiency of algorithms
We propose a CPU⁃GPU hybrid cluster schedule framework in detail
According to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into da⁃ ta fragments of different sizes to adapt to CPU and GPU computing resources

Summary

Introduction

西北工业大学学报 Journal of Northwestern Polytechnical University https: / / doi.org / 10.1051 / jnwpu / 20213930529 1.2 GPU 任务与 CPU 任务的资源需求 CPU⁃GPU 异构集群要求任务的二进制程序既数据分片策略包含 2 个目的:1面向分布式机器学习,通过数据分片,可以并行执行多个梯度计算过程,从而减少机器学习的训练时间,快速地完成模型的训练;2面向 CPU⁃GPU 异构集群环境,进行数据分片的目的是充分利用 CPU 资源和 GPU 资源。分割后,有 x 个 CPU 任务和 y 个 GPU 任务。在 n 个

Results

Conclusion