Abstract

The success of convolutional neural networks (CNNs) has made low-latency inference services on Graphic Processing Units (GPUs) a hot research topic. However, GPUs are hardware processors with high power consumption. To have the least energy consumption while meeting latency Service-Level-Objective (SLO), batching strategy and dynamic voltage frequency scaling (DVFS) are two important solutions. However, existing studies do not coordinate them and regard CNN as a black box, which makes inference services less energy-efficient. In this paper, we propose EALI, an energy-aware layer-level adaptive scheduling framework that is comprised of a power prediction model, a layer combination strategy, and an energy-aware layer-level scheduler. The power prediction model uses classic machine learning techniques to predict fine-grained layer-level power consumption. The layer combination strategy combines multiple layers into optimization units to lower scheduling overhead and complexity. The energy-aware layer-level scheduler adaptively coordinates batching strategy and layer-level DVFS according to workloads to minimize the energy consumption while meeting SLO. Our experimental results on NVIDIA Tesla M40 and V100 GPUs show that, compared to the state-of-the-art approaches, EALI decreases energy consumption by up to 36.24% while meeting SLO.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call