AEML: An Acceleration Engine for Multi-GPU Load-balancing in Distributed Heterogeneous Environment

Zhuo Tang,Li Yang,Xuedong Zhang,Lifan Du,Kenli Li

doi:10.1109/tc.2021.3084407

Abstract

For the rapid growth computation requirements in big data and artificial intelligence area, CPU-GPU heterogeneous clusters can provide more powerful computing capacity compared to CPU clusters. The number of GPUs on single computing node is scalable, which greatly improves the computing capacity of the cluster under the condition of limited cluster size. However, there is a lack of the effective load-balancing scheduling model in multi-GPU hardware environment. This paper proposes AEML, an acceleration engine for multi-GPU load-balancing in distributed heterogeneous environment. AEML can effectively integrate GPUs into distributed processing framework and achieve great load-balance among multiple heterogeneous GPUs. We propose a heterogeneous task execution model based on multiple GPUs and multiple streams (MGMS), which can effectively balance the workload of multiple GPUs. MGMS model utilizes four core techniques: a fine-grained task mapping mechanism, a device resource unified management scheme, a novel resource-aware GPU task scheduling strategy and a feedback-based streams adjustment scheme. The implementation of AEML system is based on Spark 2.4.1 and NVIDIA CUDA 10.0. We comprehensively evaluate the performance of AEML with multiple typical benchmarks. Experimental results show that AEML can fully exploit the computing power of GPUs and achieve great load-balance among multiple heterogeneous GPUs.

Full Text