Abstract

Deep learning (DL) is an increasingly important tool for large-scale data analytics and DL workloads are also common in today's production clusters due to the increasing number of deep-learning-driven services (e.g., online search and speech recognition). To handle ever-growing training datasets, it is common to conduct distributed DL (DDL) training to leverage multiple machines in parallel. Training DL models in parallel can incur significant bandwidth contention on shared clusters. As a result, the network is a well-known bottleneck for distributed training. Efficient network scheduling is essential for maximizing the performance of DL training. DL training is feedback-driven exploration (e.g., hyper-parameter tuning, model structure optimization), which requires multiple retrainings of deep learning models that differ in terms of their configuration. The information at the early stage of each retraining can facilitate the direct search for high-quality models. Thus, reducing the early-stage time can accelerate the exploration of DL training. In this paper, we propose JPAS, which is a flow scheduling system for DDL training jobs that aims at reducing the early-stage time. JPAS uses a simple greedy mechanism to periodically order all DDL jobs. Each host machine sets priorities for its flows using the corresponding job order and offloads the flow scheduling and rate allocation to the underlying priority-enabled network. We evaluate JPAS over a real testbed that is composed of 13 servers and a commodity switch. The evaluation results demonstrate that JPAS can reduce the time to reach 90% or 95% of the converged accuracy by up to 38%. Hence, JPAS can remarkably reduce the early-stage time and thus accelerate the search for high-quality models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call