PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters

Zhiang Wu,Geng Niu,Zhihua Wu,Qiangsheng Dai,Fang Dong,Jinghui Zhang,Haorui Li

doi:10.1016/j.neucom.2023.126661

Abstract

Recently, pipeline parallelism for large-scale Deep Neural Network (DNN) training has been developed, which partitions the DNN model across multiple devices (e.g., GPUs) and improves the training efficiency by processing data divided into minibatches as a pipeline. However, existing model partitioning algorithms are mostly designed for homogeneous clusters with the same GPU devices and network connections (e.g., bandwidths), while heterogeneous GPU clusters are widely used in mainstream computing infrastructures. In heterogeneous environment, devices are equipped with different GPUs and network connections, and the efficiency of previous approaches is unsatisfactory due to the unbalanced load of the pipeline stages. In this paper, we propose PipePar, a model partitioning and task placement algorithm for pipeline parallel DNN training in heterogeneous GPU clusters. PipePar is based on dynamic programming with search space pruning that takes into consideration both the heterogeneity of GPUs and network bandwidth. PipePar can profile the DNN model for each type of GPU and conduct model partitioning and task placement based on given GPUs and network connections, which can optimize pipeline load balancing in heterogeneous environments and thus improve training efficiency. We design and implement a pipeline-based distributed deep learning training system in a heterogeneous GPU cluster and show through extensive experiments that PipePar outperforms the baseline approaches in the speed of large-scale DNN training.

Full Text