Energy-Efficient Online Scheduling of Transformer Inference Services on GPU Servers

Yuxin Wang,Xiaowen Chu,Qiang Wang

doi:10.1109/tgcn.2022.3171680

Abstract

Cloud service providers are deploying Transformer-based deep learning models on GPU servers to support many online inference-as-a-service (IAAS) applications, given the predominant performance of Transformers in natural language processing (NLP) tasks. However, Transformers’ inherent high complexity and large model size (e.g., billions to hundreds of billions of parameters) tax the resource-constrained GPU servers. Improving the energy efficiency and payload capability of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. This work conducts a comprehensive study on the inference performance and energy efficiency of Transformer models. First, we empirically characterize essential performance metrics, including latency, throughput, and energy consumption on NVIDIA GPUs under various workload configurations. Second, we establish a performance and energy consumption model for Transformer that facilitates energy-efficient scheduling policies. Finally, we propose an online batch inference scheduling scheme for Transformer on GPU servers, which we refer to as the Mixed Aligned Scheduling (MAS) scheme. Compared with the existing scheduling schemes, the MAS improves the throughput and energy efficiency by up to 61.56% and 69.79% on the V100 GPU servers. Our findings expose a full scope of the characteristics of Transformer inference on GPU servers with various input shapes and workload balancing degrees. We show that merging the online batch inference with robust scheduling schemes can improve the energy efficiency and the overall inference performance under latency constraints.

Full Text