Efficient Architecture Paradigm for Deep Learning Inference as a Service

Xiaopeng Ke,Hao Huang,Fengyuan Xu,Jin Yu

doi:10.1109/ipccc50635.2020.9391551

Abstract

Deep learning (DL) inference has been broadly used and shown excellent performance in many intelligent applications. Unfortunately, the high resource consumption and training efforts of sophisticated models present obstacles for regular users to enjoy it. Thus, Deep Learning Inference as a Service (DIaaS), offering online inference services on cloud, has earned great popularity among cloud tenants who can send their DIaaS inputs via RPCs across the internal network. However, such detached architecture paradigm is inappropriate to DIaaS because the high-dimensional inputs of DIaaS consume a lot of precious internal bandwidth and the service latency of DIaaS has to be low and stable. We therefore propose a novel architecture paradigm on cloud for DIaaS in order to address the above two problems without giving up the security and maintenance benefits. We first leverage the SGX technology, a strongly-protected user space enclave, to bring DIaaS computation to its input source as close as possible, i.e. co-locating a cloud tenant and its subscribed DIaaS in the same virtual machine. When the GPU acceleration is needed, we migrate this virtual machine to any available GPU host and transparently utilize the GPU via our backend computing stack installed on it. In this way the majority of internal bandwidth is saved compared to traditional paradigm. Furthermore, we greatly improve the efficiency of the proposed architecture paradigm, from the computation and I/O perspectives, by making the entire data flow more DL-oriented. Finally, we implement a prototype system and evaluate it in real-world scenarios. The experiments show that our locality-aware architecture achieves the average single CPU (GPU) based deep learning inference time 2.84X (4.87X) less than the traditional detached architecture on average.

Full Text