Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs

Arpan Jain,Dhabaleswar K Panda,Nawras Alnaasan,Hari Subramoni,Aamir Shafi

doi:10.1109/hoti52880.2021.00017

Abstract

The Deep Learning (DL) training process consists of multiple phases — data augmentation, training, and validation of the trained model. Traditionally, these phases are executed either on the CPUs or GPUs in a serial fashion due to lack of additional computing resources to offload independent phases of DL training. Recently, Mellanox/NVIDIA has introduced the BlueField-2 DPUs which combine the advanced capabilities of traditional ASIC based network adapters with an array of ARM processors. In this paper, we characterize and explore how one can take advantage of the additional ARM cores on the BlueField-2 DPUs to intelligently accelerate different phases of DL training. We propose multiple novel designs to efficiently offload the phases of DL training to the DPUs. We evaluate our proposed designs using multiple DL models on state-of-the-art HPC clusters. Our experimental results show that the proposed designs are able to deliver up to 15% improvement in overall DL training time. To the best of our knowledge, this is the first work to explore the use of DPUs to accelerate DL training.

Full Text