Abstract
For data-parallel frameworks, their communication is highly structured. Coflow is a networking abstraction proposed for their all-or-nothing job-specific semantics. Minimizing coflow completion time (CCT) decreases the completion time of corresponding jobs. However, state-of-the-art coflow scheduling approaches suffer from several drawbacks. On the one hand, both sender-driven and receiver-driven scheduling approaches fail to achieve optimal especially when the bandwidth bottleneck exists. On the other hand, they fail to optimize the number of concurrent connections since the CCT can be prolonged due to too many or too few concurrent connections.In this paper, we propose Django, a bilateral coflow scheduling framework. We first use Support Vector Machine (SVM) as the machine learning model to automatically identify the optimal number of concurrent connections, i.e., the queue limitation in the switch. Based on the predicted results, we further develop a set of distributed coflow scheduling algorithms in a scalable manner. Testbed experiments and trace-driven simulations show that Django can estimate the number of concurrent connections with an accuracy of 98%, reduce the average CCT and 95th percentile CCT by 15% and 40%, respectively.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have