Scalable Fully Pipelined Hardware Architecture for In-Network Aggregated AllReduce Communication

Yao Liu,Wangchen Dai,Junyi Zhang,Qiaoling Wang,Ray Chak Chung Cheung,Shuo Liu

doi:10.1109/tcsi.2021.3098841

Abstract

The Ring-AllReduce framework is currently the most popular solution to deploy industry-level distributed machine learning tasks. However, only about half of the maximum bandwidth can be achieved in the optimal condition. In recent years, several in-network aggregation frameworks have been proposed to overcome the drawback, but limited hardware information have been disclosed. In this paper, we propose a scalable fully-pipelined architecture that handles tasks like forwarding, aggregation and retransmission with no bandwidth loss. The architecture is implemented on a Xilinx Ultrascale FPGA that connects to 8 working servers with 10 Gb/s network adapters, and it is able to scale to more complicated scenarios involving more workers. Compared with Ring-AllReduce, using AllReduce-Switch improves the efficient bandwidth of AllReduce communication with a ratio of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.75\times $ </tex-math></inline-formula> . In image training tasks, the proposed hardware architecture helps to achieve up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> speedup to the training process. For computing-intensive models, the speedup from communication may be partially hidden by computing. In particular, for ResNet-50, AllReduce-Switch improves the training process with MPI and NCCL by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.30\times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.04\times $ </tex-math></inline-formula> respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Circuits and Systems I: Regular Papers	Publication Date: Oct 1, 2021
Citations: 7	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Scalable Fully Pipelined Hardware Architecture for In-Network Aggregated AllReduce Communication

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems I: Regular Papers

Lead the way for us

Similar Papers

A scalable FPGA hardware architecture based on crossbar for power electronics controller
C Liu ... Q Meng
Control theory & applications | VOL. -
C Liu, et. al.C Liu ... Q Meng
01 Jan 2021
Control theory & applications | VOL. -

An Efficient Hardware Architecture for Activation Function in Deep Learning Processor
Lin Li ... Juan Wu
-
Lin Li, et. al.Lin Li ... Juan Wu
01 Jun 2018
01 Jun 2018

Parallel Processing of the Fuzzy Fingerprint Vault based on Geometric Hashing
Seung-Hoon Chae
KSII Transactions on Internet and Information Systems | VOL. 4
Seung-Hoon ChaeSeung-Hoon Chae
23 Dec 2010
KSII Transactions on Internet and Information Systems | VOL. 4

A Brain-Inspired Hardware Architecture for Evolutionary Algorithms Based on Memristive Arrays
Zilu Wang ... Xinming Shi
ACM Transactions on Design Automation of Electronic Systems | VOL. 28
Zilu Wang, et. al.Zilu Wang ... Xinming Shi
09 Sep 2023
ACM Transactions on Design Automation of Electronic Systems | VOL. 28

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scalable Fully Pipelined Hardware Architecture for In-Network Aggregated AllReduce Communication

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems I: Regular Papers