Error Recovery of RDMA Packets in Data Center Networks

Yi Wang,Bo Bai,Kexin Liu,Chen Tian,Gong Zhang

doi:10.1109/icccn.2019.8846946

Abstract

Modern data center applications need high throughput (40Gbps) and ultra-low latency (<10us per hop), along with low CPU overhead. Remote Direct Memory Access (RDMA), which can be deployed in RDMA over commodity Ethernet (RoCEv2) protocol, has the potential to satisfy the requirements. RoCEv2 needs a lossless environment to achieve high performance. RoCEv2 provides Priority-based Flow Control (PFC) to prevent packet loss caused by buffer overflow. But packet loss can still happen in todayâ€™s data centers due to other reasons such as switch configuration error. There are two retransmission algorithms dealing with the packet loss recovery: Go-Back-0 and Go-Back-N. Unfortunately, by simply applying Go-Back-N algorithm to RoCEv2, the relative throughput will drop to nearly zero when the packet loss rate exceeds 1%. This is mainly caused by the improper triggering mechanism of generating NAK. This paper proposed an Improved Go-Back-N algorithm to solve this problem, which involves two mechanism. The Improved Go-Back-N is easy to be deployed in todayâ€™s data centers because it makes no changes on switches. It can improve the relative throughput to about 60% when the packet loss rate increases to 1%.

Full Text