Abstract

AbstractNeural network models is developing toward deeper and wider to obtain higher accuracy and robustness. However, the limited physical memory capacity of existing hardware devices limits the scale of the neural network that can be trained, and the limited computing capacity resulting in excessively long training time. Therefore, the distributed parallelism scheme based on multi-accelerator machines becomes an effective method to training large-scale neural networks. The pipeline parallelism is one of the distributed parallelism scheme, which has large advantages in the training speed. But it also significantly increases the peak memory usage and communication overhead, because it needs to store multiply versions of activations. Our previous work has proposed a data transfer mechanism and applied it to the PipeDream design (a mature pipeline parallelism scheme), which offloads activations in the pipeline to other memory devices, such as the CPU memory. The data transfer mechanism greatly reduces the peak memory usage of the PipeDream, but it brings a large amount of communication, which makes the PipeDream lost a lot of training speed.This paper proposes an optimized pipeline parallelism scheme, the PipeFB, for applying the data transfer mechanism. The PipeFB deploys the forward propagation and backward propagation of the neural network on different computing nodes, which is different from the traditional pipeline parallelism scheme. We implements the PipeFB and applies the data transfer mechanism to it. The experimental results shows that our design has the same peak memory usage as the PipeDream with the data transfer mechanism, but the training speed of our design is 1.48 to 2.27 times faster.KeywordsNeural network trainingDistributed parallelismPipeline parallelismData transferring mechanism

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call