Together with the construction of RDMA networks for data center applications, the RDMA-coupled DCQCN dominates the RDMA Congestion Control (CC). However, DCQCN suffers severe performance problems in high-speed RDMA networks with modern high-performance distributed applications such as machine learning training. This paper presents RECC, inspired by both the latest emerging programmability of RDMA NICs (RNICs) and limitations in existing RDMA congestion control mechanisms. RECC comprehensively leverages RTT and ECN events from RNICs to handle congestion timely and precisely, along with a History-aware Burst Smooth mechanism to avoid wrong rate decisions under various traffic patterns. We implement RECC completely based on commercial RNICs without any modifications to switches, RDMA protocol stack, and applications. The results of microbenchmark testbed experiments and real Machine Learning (ML) workload experiments with hundreds of 200G RNICs show that RECC can significantly reduce network tail latency and pause duration by up to 64.4% and 95%, respectively, compared with DCQCN. In addition, large-scale simulations with realistic workloads demonstrate that RECC achieves comparable performance with HPCC.
Read full abstract