Proactive Congestion Avoidance for Distributed Deep Learning.

Minkoo Kang,Gyeongsik Yang,Yeonho Yoo,Chuck Yoo

doi:10.3390/s21010174

Abstract

This paper presents “Proactive Congestion Notification” (PCN), a congestion-avoidance technique for distributed deep learning (DDL). DDL is widely used to scale out and accelerate deep neural network training. In DDL, each worker trains a copy of the deep learning model with different training inputs and synchronizes the model gradients at the end of each iteration. However, it is well known that the network communication for synchronizing model parameters is the main bottleneck in DDL. Our key observation is that the DDL architecture makes each worker generate burst traffic every iteration, which causes network congestion and in turn degrades the throughput of DDL traffic. Based on this observation, the key idea behind PCN is to prevent potential congestion by proactively regulating the switch queue length before DDL burst traffic arrives at the switch, which prepares the switches for handling incoming DDL bursts. In our evaluation, PCN improves the throughput of DDL traffic by 72% on average.

Highlights

The Proactive Congestion Notification” (PCN) mechanism is implemented in two parts: (1) distributed deep learning (DDL) traffic generator that sends
Start queue length represents the queue length occupied by the background packets before the DDL packets arrive at the switch, which shows the switch queue status right before the burst packets of the iteration arrive
The average throughput is the amount of transmitted network traffic, and the burst completion time refers to the time taken for network communication of burst packets to be completed

Summary

Introduction

In DDL, network communication for parameter synchronization is known to be a major bottleneck [6,7,8]. To verify the degree of DDL traffic burstiness, we measure the traffic generated by workers. We set our DDL environment with PS architecture because it is a popular approach for parameter synchronization. We use TensorFlow v1.6 without applying additional DDL optimizations (Section 2.2). The reason is that the optimizations deal with host-side communication that schedules the tensor transmission.

Methods

Results

Discussion

Conclusion