With the increase sizes of training datasets and models, the bottleneck in distributed machine learning (DML) training has shifted from computation to communication. To address this bottleneck, we propose an all-optical switching network architecture for accelerating the communication phase of DML training. Experimental results validate packets with error-free and 385 ns server-to-server low-latency communication at traffic load of 0.9. Small-scale DML training experiment deployed in the proposed architecture shows that Resnet50, Resnet101, and Vgg19 can be accelerated by 1.16x to 1.48x compared to electrical switching network. The proposed architecture demonstrates a 58.9% enhancement in cost efficiency and a 60.9% improvement in power efficiency compared to the 3-tier fat-tree architecture.
Read full abstract