Abstract

Parallel training accelerates the Deep Neural Networks (DNN) training by parallel GPUs. While the in-memory data transmission becomes the cross-node network transmission due to distribution of GPUs on different nodes, which drags the training time. Most researches address it by reducing the data size on network links. However, the factor of network distance is ignored. In this paper, we construct a distributed DNN training architecture based on MapReduce. The customized scheduler is designed to make the computations nodes that finish the training closer to the nodes that storage data. At the same time, the parallel training models are synchronized by adjusting the data transmission time. The experimental results show that the shortened network distance benefits the reduced network traffic usage. The resulting data transmission time decreases the training time by at least 50% and guarantees the synchronization for the parallel training.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.