A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application

Niklas Schelten,Benno Stabernack,Fritjof Steinert,Anton Schulte,Justin Knapheide

doi:10.1145/3543176

Niklas Schelten, Benno Stabernack + Show 3 more

Open Access

PDF Available

https://doi.org/10.1145/3543176

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

The use of application-specific accelerators in data centers has been the state of the art for at least a decade, starting with the availability of General Purpose GPUs achieving higher performance either overall or per watt. In most cases, these accelerators are coupled via PCIe interfaces to the corresponding hosts, which leads to disadvantages in interoperability, scalability and power consumption. As a viable alternative to PCIe-attached FPGA accelerators this paper proposes standalone FPGAs as Network-attached Accelerators (NAAs) . To enable reliable communication for decoupled FPGAs we present an RDMA over Converged Ethernet v2 (RoCEv2) communication stack for high-speed and low-latency data transfer integrated into a hardware framework. For NAAs to be used instead of PCIe coupled FPGAs the framework must provide similar throughput and latency with low resource usage. We show that our RoCEv2 stack is capable of achieving 100 Gb/s throughput with latencies of less than 4μs while using about 10% of the available resources on a mid-range FPGA. To evaluate the energy efficiency of our NAA architecture, we built a demonstrator with 8 NAAs for machine learning based image classification. Based on our measurements, network-attached FPGAs are a great alternative to the more energy-demanding PCIe-attached FPGA accelerators.

Full Text