In this paper we study the relationship between the TCP packet loss cycle and the performance of time-sensitive traffic in data centers. Using real traffic measurements and analysis, we find that such loss cycles are not long enough to enable most partition-aggregate time-sensitive TCP applications to recover their packet losses via the TCP 3-dup ACKs mechanism. As a result, the Timeout (RTO) mechanism is frequently triggered, leading to the expansion of the flow completion times (FCT) of such applications by orders of magnitude. Hence, we seek an alternative method that does not change the virtual machines and that can effectively expand the loss cycle duration to enable short flows to finish their transfer without incurring the cost of the RTO. To this end, we propose a novel TCP-AQM mechanism that alternates between a slow constant bitrate (CBR) mode and a fast TCP rate via hysteresis switching to expand the loss cycle. We prove the stability of the proposed TCP-AQM via a control theoretic model, then evaluate its performance gains via small and large scale NS2 simulation and by real FPGA implementation of a prototype on the NetFPGA platform. The results show considerable improvements in FCT distribution and reduction of missed deadlines in simulation and real experiments.
Read full abstract