Enabling Distributed and Optimal RDMA Resource Sharing in Large-Scale Data Center Networks: Modeling, Analysis, and Implementation

Dian Shen,Fang Dong,John C S Lui,Junzhou Luo,Ciyuan Chen,Kai Wang,Xiaolin Guo

doi:10.1109/tnet.2023.3263562

Abstract

Remote Direct Memory Access (RDMA) suffers from unfairness issues and performance degradation when multiple applications share RDMA network resources. Hence, an efficient resource scheduling mechanism is urged to optimally allocates RDMA resources among applications. However, traditional Network Utility Maximization (NUM) based solutions are inadequate for RDMA due to three challenges: 1) The standard NUM-oriented algorithm cannot deal with coupling variables introduced by multiple dependent RDMA operations; 2) The stringent constraint of RDMA on-board resources complicates the standard NUM by bringing extra optimization dimensions; 3) Naively applying traditional algorithms for NUM suffers from scalability issues in solving a large-scale RDMA resource scheduling problem. In this paper, we present how to optimally share the RDMA resources in large-scale data center networks with a distributed manner. First, we propose Distributed RDMA NUM (DRUM) to model the RDMA resource scheduling problem as a new variation of the NUM problem. Second, we present distributed algorithms to efficiently solve the large-scale, interdependent RDMA resource sharing problem for different RDMA use cases. Through theoretical analysis, the convergence and parallelism of proposed algorithms are guaranteed. Finally, we implement the algorithms as a kernel-level indirection module in the real-world RDMA environment, so as to provide end-to-end resource sharing and performance guarantee. Through extensive evaluations by large-scale simulations and testbed experiments, we show that our method significantly improves applications’ performance under resource contention, achieving <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.7-3.1\times$</tex-math> </inline-formula> higher throughput, and in a dynamic context, the largest performance improvement reaches <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$98.1\%$</tex-math> </inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$64.1\%$</tex-math> </inline-formula> in terms of latency and throughput, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Enabling Distributed and Optimal RDMA Resource Sharing in Large-Scale Data Center Networks: Modeling, Analysis, and Implementation

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Networking

Lead the way for us

Similar Papers

Distributed and Optimal RDMA Resource Scheduling in Shared Data Center Networks
Dian Shen ... Kai Wang
-
Dian Shen, et. al.Dian Shen ... Kai Wang
01 Jul 2020
01 Jul 2020

Toward Effective and Fair RDMA Resource Sharing
Haonan Qiu ... Baoliu Ye
-
Haonan Qiu, et. al.Haonan Qiu ... Baoliu Ye
01 Aug 2018
01 Aug 2018

A Time Variant Fluid Model for DCQCN Congestion Control Protocol
Xinghua Zhao ... Jun Xu
-
Xinghua Zhao, et. al.Xinghua Zhao ... Jun Xu
11 Nov 2022
11 Nov 2022

P4QCN: Congestion Control Using P4-Capable Device in Data Center Networks
Junjie Geng ... Yuan Zhang
Electronics | VOL. 8
Junjie Geng, et. al.Junjie Geng ... Yuan Zhang
02 Mar 2019
Electronics | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enabling Distributed and Optimal RDMA Resource Sharing in Large-Scale Data Center Networks: Modeling, Analysis, and Implementation

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Networking