Improving the Performance of Distributed TensorFlow with RDMA

Chengfan Jia,Mengxian Chi,Wenting Han,Zheng Wu,Hong An,Xu Jin,Junnan Liu,Han Lin

doi:10.1007/s10766-017-0520-3

Abstract

TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open-sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6 $$\times $$ performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.

Full Text