OPTWEB: A Lightweight Fully Connected Inter-FPGA Network for Efficient Collectives

Kenji Mizutani,Michihiro Koibuchi,Hiroshi Yamaguchi,Yutaka Urino

doi:10.1109/tc.2021.3068715

Abstract

Modern FPGA accelerators can be equipped with many high-bandwidth network I/Os, e.g., 64 x 50 Gbps, enabled by onboard optics or co-packaged optics. Some dozens of tightly coupled FPGA accelerators form an emerging computing platform for distributed data processing. However, a conventional indirect packet network using Ethernet's Intellectual Properties imposes an unacceptably large amount of the logic for handling such high-bandwidth interconnects on an FPGA. Besides the indirect network, another approach builds a direct packet network. Existing direct inter-FPGA networks have a low-radix network topology, e.g., 2-D torus. However, the low-radix network has the disadvantage of a large diameter and large average shortest path length that increases the latency of collectives. To mitigate both problems, we propose a lightweight, fully connected inter-FPGA network called OPTWEB for efficient collectives. Since all end-to-end separate communication paths are statically established using onboard optics, raw block data can be transferred with simple link-level synchronization. Once each source FPGA assigns a communication stream to a path by its internal switch logic between memory-mapped and stream interfaces for remote direct memory access (RDMA), a one-hop transfer is provided. Since each FPGA performs input/output of the remote memory access between all FPGAs simultaneously, multiple RDMAs efficiently form collectives. The OPTWEB network provides 0.71-μsec start-up latency of collectives among multiple Intel Stratix 10 MX FPGA cards with onboard optics. The OPTWEB network consumes 31.4 and 57.7 percent of adaptive logic modules for aggregate 400-Gbps and 800-Gbps interconnects on a custom Stratix 10 MX 2100 FPGA, respectively. The OPTWEB network reduces by 40 percent the cost compared to a conventional packet network.

Highlights

Parallel data processing using multiple field- cant, the inter-FPGA communication start-up latency typprogrammable gate array (FPGA) accelerators with ically reaches tens of sec order even on a small traditionhigh-bandwidth memory, e.g., HBM2, and high- al system [29], [30].bandwidth network, becomes a way to compute emerg-Direct interconnection networks are attempted on ing parallel applications including deep neural networks an FPGA-accelerator system, e.g., Project Catapult v1 [6],[1] or columnar database [2]
In Japan, we have developed onboard optics using Optical I/O Core for 100-Gbps transceiver (4×25 Gbps), and we have integrated these optics into a custom Intel Stratix 10 FPGA card [19]
In the prototype accelerator system, four ports are used for four FPGAs

Summary

Introduction

Parallel data processing using multiple field- cant, the inter-FPGA communication start-up latency typprogrammable gate array (FPGA) accelerators with ically reaches tens of sec order even on a small traditionhigh-bandwidth memory, e.g., HBM2, and high- al system [29], [30].bandwidth network, becomes a way to compute emerg-Direct interconnection networks are attempted on ing parallel applications including deep neural networks an FPGA-accelerator system, e.g., Project Catapult v1 [6],[1] or columnar database [2]. Parallel data processing using multiple field- cant, the inter-FPGA communication start-up latency typprogrammable gate array (FPGA) accelerators with ically reaches tens of sec order even on a small traditionhigh-bandwidth memory, e.g., HBM2, and high- al system [29], [30]. Throughput data processing, e.g., sorting operation [3], [4], some FPGA cards have network ports for [5]. In current Ethernet switches, electric SERDES conversion consumes significant power, and the broad area of the aggregate I/O pluggable ports increases the onboard wire length. To mitigate both problems, the optical technology should be tightly coupled with a switch ASIC. Onboard optics are needed to support up to 40 Tbps switch ASIC [22], and CPO commercially becomes mature before 51.2-Tbps switch ASIC is deployed

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Computers	Publication Date: Jun 1, 2021
Citations: 9	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

OPTWEB: A Lightweight Fully Connected Inter-FPGA Network for Efficient Collectives

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Computers

Lead the way for us

Similar Papers

RVMA: Remote Virtual Memory Access
Ryan E Grant ... Matthew G.F Dosanjh
-
Ryan E Grant, et. al.Ryan E Grant ... Matthew G.F Dosanjh
01 May 2021
01 May 2021

An Evaluation of the One-Sided Performance in Open MPI
Nathan Hjelm
-
Nathan HjelmNathan Hjelm
25 Sep 2016
25 Sep 2016

Intra-Epoch Message Scheduling To Exploit Unused or Residual Overlapping Potential
Judicael A Zounmevo ... Ahmad Afsahi
-
Judicael A Zounmevo, et. al.Judicael A Zounmevo ... Ahmad Afsahi
09 Sep 2014
09 Sep 2014

Fast In-Memory Key–Value Cache System with RDMA
Wei Chen ... Songping Yu
Journal of Circuits, Systems and Computers | VOL. 28
Wei Chen, et. al.Wei Chen ... Songping Yu
01 May 2019
Journal of Circuits, Systems and Computers | VOL. 28

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

OPTWEB: A Lightweight Fully Connected Inter-FPGA Network for Efficient Collectives

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Computers