Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR

Quentin Anthony,Arpan Jain,Hari Subramoni,Ammar Ahmad Awan,Dhabaleswar K Dk Panda

doi:10.1109/ipdpsw50202.2020.00171

Abstract

Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovod-based distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2-GDR. We achieved a “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance.

Full Text