Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment

Junghoon Woo,Hyeonseong Choi,Jaehwan Lee

doi:10.3390/app10196717

Junghoon Woo, Hyeonseong Choi + Show 1 more

Open Access

https://doi.org/10.3390/app10196717

Copy DOI

Journal: Applied Sciences	Publication Date: Sep 25, 2020
Citations: 3	License type: CC BY 4.0

Affiliation: Korea Aerospace University

Abstract

To accommodate lots of training data and complex training models, “distributed” deep learning training has become employed more and more frequently. However, communication bottlenecks between distributed systems lead to poor performance of distributed deep learning training. In this study, we proposed a new collective communication method in a Python environment by utilizing Multi-Channel Dynamic Random Access Memory (MCDRAM) in Intel Xeon Phi Knights Landing processors. Major deep learning software platforms, such as TensorFlow and PyTorch, offer Python as a main development language, so we developed an efficient communication library by adapting Memkind library, which is a C-based library to utilize high-performance memory MCDRAM. For performance evaluation, we tested the popular collective communication methods in distributed deep learning, such as Broadcast, Gather, and AllReduce. We conducted experiments to analyze the effect of high-performance memory and processor location on communication performance. In addition, we analyze performance in a Docker environment for further relevance given the recent major trend of Cloud computing. By extensive experiments in our testbed, we confirmed that the communication in our proposed method showed performance improvement by up to 487%.

Highlights

The performance of deep learning [1] has improved significantly, thanks to the advanced neural network architecture, computer hardware (HW) performance improvement, and utilization of large-scale datasets
Sub-Non-Uniform Memory Access (NUMA) Clustering-4 (SNC-4) mode looks like 4 sockets in succession because each quadrant appears as a separate NUMA domain to the operating system
5.361 In Table 1, when Broadcast performed Message Passing Interface (MPI) collective communication with 16 processes, it was confirmed that Multi-Channel Dynamic Random Access Memory (MCDRAM) performance improved 445% over Double Data Rate 4 (DDR4)

Summary

Introduction

The performance of deep learning [1] has improved significantly, thanks to the advanced neural network architecture, computer hardware (HW) performance improvement, and utilization of large-scale datasets. To train the deep learning model in the parameter server architecture, communication between the parameter server and the worker is required. TensorFlow and PyTorch, which are typical frameworks for deep learning, are based on Python For these reasons, Python should use multiple cores and simultaneous communications to support distributed deep learning efficiently. In Python-based distributed deep learning environments, we can use multi-processing or multi-threading methods. We propose a novel method of efficiently exchanging data when performing distributed deep learning in Python and many-core CPU environments, and we compare and analyze the proposed method through extensive experiments in various environments.

Related Work

Distributed Deep Learning

Python GIL

Python-Based MPI Collective Communication Library

Memkind Library for Using MCDRAM

Architecture

Wrapping

Implement

Import

Binding

Non-Uniform

Performance Evaluation

Broadcast

Gather

AllReduce

Summary of Experimental Results

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Simulating Multiphase Flows in Porous Media Using OpenFOAM on Intel Xeon Phi Knights Landing Processors
Zhi Shang ... Honggao Liu
-
Zhi Shang, et. al.Zhi Shang ... Honggao Liu
09 Jul 2017
09 Jul 2017

JPAS: Job-progress-aware flow scheduling for deep learning clusters
Pan Zhou ... Gang Sun
Journal of Network and Computer Applications | VOL. 158
Pan Zhou, et. al.Pan Zhou ... Gang Sun
11 Mar 2020
Journal of Network and Computer Applications | VOL. 158

Collective Communication Performance Evaluation for Distributed Deep Learning Training
Sookwang Lee ... Jaehwan Lee
Applied Sciences | VOL. 14
Sookwang Lee, et. al.Sookwang Lee ... Jaehwan Lee
12 Jun 2024
Applied Sciences | VOL. 14

Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures
Ioan Hadade ... Luca Di Mare
Computer Physics Communications | VOL. 235
Ioan Hadade, et. al.Ioan Hadade ... Luca Di Mare
18 Jul 2018
Computer Physics Communications | VOL. 235

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences