Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster

Víctor Campos,Francesc Sastre,Maurici Yagües,Míriam Bellver,Xavier Giró-I-Nieto,Jordi Torres

doi:10.1016/j.procs.2017.05.074

Víctor Campos, Francesc Sastre + Show 4 more

Open Access

https://doi.org/10.1016/j.procs.2017.05.074

Copy DOI

Abstract

Deep learning algorithms base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples, so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained. In this work, we explore how the training of a state-of-the-art neural network for computer vision can be parallelized on a distributed GPU cluster. The effect of distributing the training process is addressed from two different points of view. First, the scalability of the task and its performance in the distributed setting are analyzed. Second, the impact of distributed training methods on the final accuracy of the models is studied.

Full Text