ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Jing Dong ,Guohui Wang,Liuyihan Song,Heng Pan,Zhisheng Xia,Shaochuang Wang,Hao Li,Lingbo Tang,Yiqun Guo,Shanyuan Gao,Yong Li,Pengcheng Li,Xiaowei Jiang,Yingya Zhang,Jie Zhang,Zheng Cao,Qianyuan Ran,Pan Pan,Fang Fěng ,Ling-Yun Xin

doi:10.1109/mm.2021.3091475

ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Jing Dong , Guohui Wang + Show 18 more

https://doi.org/10.1109/mm.2021.3091475

Copy DOI

Journal: IEEE micro	Publication Date: Sep 1, 2021
Citations: 2

Affiliation: Alibaba Group (China)

#Collective Communication Library #Collective Library + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.

Full Text