EasyDist

Varun Natu,Rahul Ghosh

doi:10.1145/3297001.3297037

Abstract

With the increasing number of artificial intelligence (AI) applications, complex neural network based modeling on large scale data is becoming quite common. Using distributed deep learning (DDL), a data scientist can harness the computational power of multiple graphics processing units (GPUs) spread across different machines and significantly reduce neural network training times. In this paper, we focus on a TensorFlow based distributed training approach, which, unfortunately, leaves the auxiliary modeling capabilities (e.g., resource deployment, data distribution, run-time control etc.) to user-discretion. Additionally, these training capabilities cannot be accessed directly through popular high-level interfaces such as Keras. As a result, the transition to DDL from a single node training setting involves a steep learning curve for data scientists. To bridge this gap, we present EasyDist, an end-to-end DDL tool that preserves the single-node programming model by leveraging distributed TensorFlow between a Keras interface and public Cloud infrastructure. EasyDist is model agnostic, i.e., any neural network model written in Keras can use EasyDist for distributed training. Internally, EasyDist incorporates resource deployment, data allocation, training orchestration and result aggregation within a clean abstraction which allows the data scientist to focus on model definition. Evaluation of EasyDist on publicly available benchmark datasets and models shows that the model accuracy is not compromised in any significant way while the training times can be reduced upto ~6-8x compared to single machine settings.

Full Text