GaDei: On Scale-Up Training as a Service for Deep Learning

Wei Zhang,Bowen Zhou,Yufei Ren,Yunhui Zheng,Ji Liu,Bing Xiang,Peng Liu,Fei Wang,Li Zhang,Minwei Feng,Yandong Wang

doi:10.1109/icdm.2017.161

Abstract

Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. TaaS must satisfy a wide range of customers who have no experience and/or resources to tune DL hyper-parameters (e.g., mini-batch size and learning rate), and meticulous tuning for each user's dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that are applicable to all users. Unfortunately, few research papers have studied how to design a system for TaaS workloads. By evaluating the IBM Watson Natural Language Classfier (NLC) workloads, the most popular IBM cognitive service used by thousands of enterprise-level clients globally, we provide empirical evidence that only the conservative hyper-parameter setup (e.g., small mini-batch size) can guarantee acceptable model accuracy for a wide range of customers. Unfortunately, smaller mini-batch size requires higher communication bandwidth in a parameter-server based DL training system. In this paper, we characterize the exceedingly high communication bandwidth requirement of TaaS using representative industrial deep learning workloads. We then present GaDei, a highly optimized shared-memory based scale-up parameter server design. We evaluate GaDei using both commercial benchmarks and public benchmarks and demonstrate that GaDei significantly outperforms the state-of-the-art parameter-server based implementation while maintaining the required accuracy. GaDei achieves near-best-possible runtime performance, constrained only by the hardware limitation. Furthermore, to the best of our knowledge, GaDei is the only scale-up DL system that provides fault-tolerance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

GaDei: On Scale-Up Training as a Service for Deep Learning

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study
Suyog Gupta ... Wei Zhang
-
Suyog Gupta, et. al.Suyog Gupta ... Wei Zhang
01 Aug 2017
01 Aug 2017

A stochastic gradient method with variance control and variable learning rate for Deep Learning
Giorgia Franchini ... Luca Zanni
Journal of Computational and Applied Mathematics | VOL. 451
Giorgia Franchini, et. al.Giorgia Franchini ... Luca Zanni
14 Jun 2024
Journal of Computational and Applied Mathematics | VOL. 451

Abstract 184: The utility of deep metric learning for breast cancer identification on mammographic images
Justin Du ... Sanjay Aneja
Cancer Research | VOL. 81
Justin Du, et. al.Justin Du ... Sanjay Aneja
01 Jul 2021
Cancer Research | VOL. 81

Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study
Suyog Gupta ... Fei Wang
-
Suyog Gupta, et. al.Suyog Gupta ... Fei Wang
01 Dec 2016
01 Dec 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

GaDei: On Scale-Up Training as a Service for Deep Learning

Abstract

Talk to us

Similar Papers