Abstract

Cloud service providers are deploying Transformer-based deep learning models on GPU servers to support many online inference-as-a-service (IAAS) applications, given the predominant performance of Transformers in natural language processing (NLP) tasks. However, Transformers’ inherent high complexity and large model size (e.g., billions to hundreds of billions of parameters) tax the resource-constrained GPU servers. Improving the energy efficiency and payload capability of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. This work conducts a comprehensive study on the inference performance and energy efficiency of Transformer models. First, we empirically characterize essential performance metrics, including latency, throughput, and energy consumption on NVIDIA GPUs under various workload configurations. Second, we establish a performance and energy consumption model for Transformer that facilitates energy-efficient scheduling policies. Finally, we propose an online batch inference scheduling scheme for Transformer on GPU servers, which we refer to as the Mixed Aligned Scheduling (MAS) scheme. Compared with the existing scheduling schemes, the MAS improves the throughput and energy efficiency by up to 61.56% and 69.79% on the V100 GPU servers. Our findings expose a full scope of the characteristics of Transformer inference on GPU servers with various input shapes and workload balancing degrees. We show that merging the online batch inference with robust scheduling schemes can improve the energy efficiency and the overall inference performance under latency constraints.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call