Optimizing Distributed DNN Training Using CPUs and BlueField-2 DPUs

Arpan Jain,Aamir Shafi,Hari Subramoni,Nawras Alnaasan,Dhabaleswar K Panda

doi:10.1109/mm.2021.3139027

Abstract

The deep learning (DL) training process consists of multiple phases—data augmentation, training, and validation of the trained model. Traditionally, these phases are executed either on the central processing units or graphics processing units in a serial fashion due to lack of additional computing resources to offload independent phases of DL training. Recently, Mellanox/NVIDIA introduced the BlueField-2 data processing units (DPUs), which combine the advanced capabilities of traditional application-specific-integrated-circuit-based network adapters with an array of ARM processors. In this article, we explore how to take advantage of the additional ARM cores on the BlueField-2 DPUs. We propose and evaluate multiple novel designs to efficiently offload the phases of DL training to the DPUs. Our experimental results show that the proposed designs are able to deliver up to 17.5% improvement in overall DL training time. To the best of our knowledge, this is the first work to explore the use of DPUs to accelerate DL training.

Full Text