Distributed deep learning system for cancerous region detection on Sunway TaihuLight

Guofeng Lv,Fei Wang,Wenting Han,Mingfan Li,Qian Xiao,Rongfen Lin,Junshi Chen,Han Lin,Hong An

doi:10.1007/s42514-020-00046-5

Abstract

To explore the potential of distributed training on deep neural networks, we implement several distributed algorithms with the basis of swFlow on the world-leading supercomputer, Sunway TaihuLight. Based on two naive designs of parameter server and ring all-reduce, we present the limitation of the communication model and discuss the optimizations for adapting the five-level interconnect architecture of Sunway system. To reduce the communication bottleneck on large scale system, multi-severs and hierarchical ring all-reduce models are introduced. With a benchmark from deep learning-based cancerous region detection algorithm, the average parallel efficiency obtains over 80% for at most 1024 processors. It reveals the great opportunity for joint combination of deep learning and HPC system.

Full Text