IRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems

Yufei Ren,Xingbo Wu,Li Zhang,Wei Zhang,Song Jiang,Yandong Wang,Zijun Wang,Michel Hack

doi:10.1109/hpcc-smartcity-dss.2017.30

Abstract

Distributed deep learning systems place stringent requirement on communication bandwidth in its model training with large volumes of input data under user-time constraint. The communications take place mainly between cluster of worker nodes for training data and parameter servers for maintaining a global trained model. For fast convergence the worker nodes and parameter servers have to frequently exchange billions of parameters to quickly broadcast updates and minimize staleness. Demand on the bandwidth becomes even higher with the introduction of dedicated GPUs in the computation. While RDMA-capable network has a great potential to provide sufficiently high bandwidth, its current use over TCP/IP or tied to particular programming models, such as MPI, limits its capability to break the bandwidth bottleneck. In this work we propose iRDMA, an RDMA-based parameter server architecture optimized for high-performance network environment supporting both GPU- and CPU-based training. It utilizes native asynchronous RDMA verbs to achieve network line speed while minimizing the communication processing cost on both worker and parameter-server sides. Furthermore, iRDMA exposes the parameter server system as a POSIX-compatible file API for convenient support of load balance and fault tolerance as well as its easy use. We have implemented iRDMA at IBM's deep learning platform. Experiment results show that our design can help deep learning applications, including image recognition and language classification, to achieve near-linear improvement on convergence speed and training accuracy acceleration by using distributed computing resources. From the system perspective, iRDMA can efficiently utilize about 95% network bandwidth of fast networks to synchronize models among distributed training processes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

IRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Towards Mitigating Straggler with Deep Reinforcement Learning in Parameter Server
Haodong Lu ... Kun Wang
-
Haodong Lu, et. al.Haodong Lu ... Kun Wang
09 Aug 2020
09 Aug 2020

A Multi-Institutional External Validation of a Deep-Learning Based Platform for Prediction of Outcomes following SBRT Treatment for Early-Stage Non-Small Cell Lung Cancer
B.A Greenberger ... S Aneja
International Journal of Radiation Oncology*Biology*Physics | VOL. 108
B.A Greenberger, et. al.B.A Greenberger ... S Aneja
23 Oct 2020
International Journal of Radiation Oncology*Biology*Physics | VOL. 108

Evaluating the deep learning software tools for large-scale enterprises using a novel TODIFFA-MCDM framework
Zoran Gligorić ... Hande Küçükönder
Journal of King Saud University - Computer and Information Sciences | VOL. 36
Zoran Gligorić, et. al.Zoran Gligorić ... Hande Küçükönder
31 May 2024
Journal of King Saud University - Computer and Information Sciences | VOL. 36

Green, Yellow, Yield: End-Host Traffic Scheduling for Distributed Deep Learning with TensorLights
Xin Sunny Huang ... T S Eugene Ng
-
Xin Sunny Huang, et. al.Xin Sunny Huang ... T S Eugene Ng
01 May 2019
01 May 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

IRDMA: Efficient Use of RDMA in Distributed Deep Learning Systems

Abstract

Talk to us

Similar Papers