Abstract

In distributed deep learning (DL), collective communication algorithms, such as Allreduce, used to share training results between graphical processing units (GPUs) are an inevitable bottleneck. We hypothesize that a cache access latency occurred at every Allreduce is a significant bottleneck in the current computational systems with high-bandwidth interconnects for distributed DL. To reduce this frequency of latency, it is important to aggregate data at the network interfaces. We implement a data aggregation circuit in a field-programmable gate array (FPGA). Using this FPGA, we proposed novel Allreduce architecture and training strategy without accuracy degradation. Results of the measurement show Allreduce latency reduction to 1/4. Our system can also conceal about 90% of the communication overhead and improve scalability by 20%. The end-to-end time consumed for training in distributed DL with ResNet-50 and ImageNet is reduced to 87.3% without any degradation in validation accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.