Efficient low-latency packet processing using On-GPU Thread-Data Remapping

Huanxin Lin,Cho-Li Wang

doi:10.1016/j.jpdc.2019.06.009

Abstract

Graphics processing units are widely-used for packet processing acceleration in both physical and virtual networks. However, real-life packets come in highly-divergent sizes, causing severe GPU control flow divergence. Previous solutions rely on CPU preprocessing to reduce divergence, but it forbids the more efficient NIC–GPU packet streaming as packet batches have to stop completely at host machine. To fully utilize both GPU and PCIe resources, we propose Blink as a GPU modular software router. Instead of CPU pre-processing, the Blink router uses On-GPU Thread-Data Remapping to reduce divergence, and our novel Cross-Iteration Thread Event Signaling mechanism filters unnecessary inter-thread synchronization, doubling the performance gain achieved by traditional solution. Serving as a TCP/IP router with Deep Packet Inspection (DPI) firewall, Blink can sustain processing throughput of 31.5 GBit/s over a PCIe bandwidth of 32 GBit/s. Given a certain bandwidth, Blink reduces processing latency at least by half compared with other works.

Full Text