Optimizing DNN Compilation for Distributed Training With Joint OP and Tensor Fusion

Xiaodong Yi,Jun Yang,Chuan Wu,Zhen Zheng,Wei Lin,Shiwei Zhang,Siyu Wang,Lansong Diao,Shiqing Fan

doi:10.1109/tpds.2022.3201531

Xiaodong Yi, Jun Yang + Show 7 more

Open Access

PDF Available

https://doi.org/10.1109/tpds.2022.3201531

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

This article proposes <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DisCo</i> , an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DisCo</i> optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communication inefficiency that they incur. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DisCo</i> generates optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A GNN-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DisCo</i> with existing DL fusion schemes and show that it achieves good training speed-up close to the ideal, full computation-communication overlap case.

Full Text