Abstract

Inference-as-a-service (IAAS) has been recently launched by cloud service providers to support on-demand AI applications. Many natural language processing (NLP) services are based on the Transformer Sequence Transduction model. However, the inference process of the Transformer model consumes a significant amount of energy due to the large model size (e.g., billions of parameters) and tremendous computations. How to reduce the energy consumption of IAAS without violating the service-level agreement (SLA) becomes a practical challenge for service providers. In this work, we conduct a comprehensive study on the inference performance and energy efficiency of a Transformer model trained for the language translation service. First, we empirically characterize some essential performance metrics, including latency, throughput, and energy consumption on three different GPUs with diversified workload configurations. The detailed workload separation facilitates a thorough and deep understanding of the inference process of the Transformer model. Second, we provide an energy consumption model for the Transformer based on the observed data. Finally, we propose the Aligned scheduling scheme that optimizes throughput and energy efficiency with up to 2.86× and 2.73× improvement at the cost of 40% average latency loss. Our findings provide a full scope of Transformer inference, and suggest that the workload balancing and scheduling have great potentials to offer energy-efficient Transformer inference services.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call