Abstract

In recent years, many companies have developed various distributed computation frameworks for processing machine learning (ML) jobs in clusters. Networking is a well-known bottleneck for ML systems and the cluster demands efficient scheduling for huge traffic (up to 1GB per flow) generated by ML jobs. Coflow has been proven an effective abstraction to schedule flows of such data-parallel applications. However, the implementation of coflow scheduling policy is constrained when coflow characteristics are unknown a prior, and when TCP congestion control misinterprets the congestion signal leading to low throughput. Fortunately, traffic patterns experienced by some ML jobs support to speculate the complete coflow characteristic with limited information. Hence this paper summarizes coflow from these ML jobs as self-similar coflow and proposes a decentralized self-similar coflow scheduler Cicada. Cicada assigns each coflow a probe flow to speculate its characteristics during the transportation and employs the Shortest Job First (SJF) to separate coflow into strict priority queues based on the speculation result. To achieve full bandwidth for throughput- sensitive ML jobs, and to guarantee the scheduling policy implementation, Cicada promotes the elastic transport-layer rate control that outperforms prior works. Large-scale simulations show that Cicada completes coflow 2.08x faster than the state-of-the-art schemes in the information-agnostic scenario.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call