Abstract

To accommodate lots of training data and complex training models, “distributed” deep learning training has become employed more and more frequently. However, communication bottlenecks between distributed systems lead to poor performance of distributed deep learning training. In this study, we proposed a new collective communication method in a Python environment by utilizing Multi-Channel Dynamic Random Access Memory (MCDRAM) in Intel Xeon Phi Knights Landing processors. Major deep learning software platforms, such as TensorFlow and PyTorch, offer Python as a main development language, so we developed an efficient communication library by adapting Memkind library, which is a C-based library to utilize high-performance memory MCDRAM. For performance evaluation, we tested the popular collective communication methods in distributed deep learning, such as Broadcast, Gather, and AllReduce. We conducted experiments to analyze the effect of high-performance memory and processor location on communication performance. In addition, we analyze performance in a Docker environment for further relevance given the recent major trend of Cloud computing. By extensive experiments in our testbed, we confirmed that the communication in our proposed method showed performance improvement by up to 487%.

Highlights

  • The performance of deep learning [1] has improved significantly, thanks to the advanced neural network architecture, computer hardware (HW) performance improvement, and utilization of large-scale datasets

  • Sub-Non-Uniform Memory Access (NUMA) Clustering-4 (SNC-4) mode looks like 4 sockets in succession because each quadrant appears as a separate NUMA domain to the operating system

  • 5.361 In Table 1, when Broadcast performed Message Passing Interface (MPI) collective communication with 16 processes, it was confirmed that Multi-Channel Dynamic Random Access Memory (MCDRAM) performance improved 445% over Double Data Rate 4 (DDR4)

Read more

Summary

Introduction

The performance of deep learning [1] has improved significantly, thanks to the advanced neural network architecture, computer hardware (HW) performance improvement, and utilization of large-scale datasets. To train the deep learning model in the parameter server architecture, communication between the parameter server and the worker is required. TensorFlow and PyTorch, which are typical frameworks for deep learning, are based on Python For these reasons, Python should use multiple cores and simultaneous communications to support distributed deep learning efficiently. In Python-based distributed deep learning environments, we can use multi-processing or multi-threading methods. We propose a novel method of efficiently exchanging data when performing distributed deep learning in Python and many-core CPU environments, and we compare and analyze the proposed method through extensive experiments in various environments.

Related Work
Distributed Deep Learning
Python GIL
Python-Based MPI Collective Communication Library
Memkind Library for Using MCDRAM
Architecture
Wrapping
Implement
Import
Binding
Non-Uniform
Performance Evaluation
Broadcast
Gather
AllReduce
Summary of Experimental Results
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.