Distributed Training of Large-Scale Deep Learning Models in Commodity Hardware

Jubaer Ahmad,Md Shahadat Hossain,Fahim Al Awsaf,Md Yasir Arafat,Md Motaharul Islam,Tahsin Elahi Navin

doi:10.1007/978-981-99-1624-5_52

Abstract

Running deep learning models on a computer is often resource intensive and time-consuming. Deep learning models require high-performance GPUs to train on big data. It might take days and months to train models with large datasets, even with the help of high-performance GPUs. This paper provides an affordable solution for executing models within a reasonable time interval. We propose a system which is perfect to distribute large-scale deep learning models in commodity hardware. Our model consists of creating distributed computing clusters using only open-source software which can provide comparable performance to High-Performance Computing clusters even with the absence of GPUs. Hadoop clusters are created by connecting servers with a SSH network to interconnect computers and enable continuous data transfer between them. We then set up Apache Spark on our Hadoop cluster. Then, we run BigDL on top of Spark. It is a high-performance Spark library that helps us scale to massive datasets. BigDL helps us run large deep learning models locally in Jupyter Notebook and simplifies cluster computing and resource management. This environment provides computation performance up to 70% faster than a single machine execution with the option of scaling in case of model training, data throughput, hyperparameter search, and resource utilization.

Full Text