Coordinated Batching and DVFS for DNN Inference on GPU Accelerators

Seyed Morteza Nabavinejad,Masoumeh Ebrahimi,Sherief Reda

doi:10.1109/tpds.2022.3144614

Abstract

Employing hardware accelerators to improve the performance and energy-efficiency of DNN applications is on the rise. One challenge of using hardware accelerators, including the GPU-based ones, is that their performance is limited by internal and external factors, such as power caps. A common approach to meet the power cap constraint is using the Dynamic Voltage Frequency Scaling (DVFS) technique. However, the functionally of this technique is limited and platform-dependent. To tackle this challenge, we propose a new control knob, which is the size of input batches fed to the GPU accelerator in DNN inference applications. We first evaluate the impact of batch size on power consumption and performance of DNN inference. Then, we introduce the design and implementation of a fast and lightweight runtime system, called BatchDVFS. Dynamic batching is implemented in <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">BatchDVFS</i> to adaptively change the batch size, and hence, trade-off throughput with power consumption. It employs an approach based on binary search to find the proper batch size within a short period of time. Combining dynamic batching with the DVFS technique, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">BatchDVFS</i> can control the power consumption in wider ranges, and hence, yield higher throughput in the presence of power caps. To find near-optimal solution for long-running jobs that can afford a relatively significant profiling overhead, compared with <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">BatchDVFS</i> overhead, we also design an approach, called BOBD, that employs Bayesian Optimization to wisely explore the vast state space resulted by combination of the batch size and DVFS solutions. Conducting several experiments using a modern GPU and several DNN models and input datasets, we show that our <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">BatchDVFS</i> can significantly surpass the techniques solely based on DVFS or batching, regarding throughput (up to 11.2x and 2.2x, respectively), while successfully meeting the power cap.

Full Text