Abstract

Nowadays, various AI applications based on Convolutional Neural Networks (CNNs) are widely deployed on GPU-accelerated devices. However, due to the lack of visibility into GPU internal scheduling, accurately modeling the performance of CNN inference tasks or estimating the latency of CNN tasks that are executing or waiting on the GPU is challenging. This hurts the multi-model scheduling on multi-device and CNN real-time inference. Therefore, in this paper, we propose a time estimation method to estimate the forward execution time of a convolutional layer with an arbitrary shape on a GPU. The proposed method divides an explicit General Matrix Multiplication (GEMM) convolution operation into a series of estimatable GPU operations and constructs performance models at the level of sub-operations rather than hardware instructions or entire models. Also, the proposed method can be easily adapted to different hardware devices or underlying algorithm implementations, since it focuses on the variation of execution time relative to the input data scale, without focusing on specific instructions or hardware actions. According to the experiments on four typical CUDA compatible platforms, the proposed method has an average error rate of less than 5% for convolutional layers in some practical CNN models, and has about 8% error rate in estimating GEMM convolution implementations provided by cuDNN library. Experiments show that the proposed method can predict the forward execution time of convolutional layers of arbitrary size in CNN inference tasks on different GPU models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call