Abstract
A lot of deep learning applications are desired to be run on mobile devices. Both accuracy and inference time are meaningful for a lot of them. While the number of FLOPs is usually used as a proxy for neural network latency, it may not be the best choice. In order to obtain a better approximation of latency, the research community uses lookup tables of all possible layers for the calculation of the inference on a mobile CPU. It requires only a small number of experiments. Unfortunately, on a mobile GPU, this method is not applicable in a straightforward way and shows low precision. In this work, we consider latency approximation on a mobile GPU as a data- and hardware-specific problem. Our main goal is to construct a convenient Latency Estimation Tool for Investigation (LETI) of neural network inference and building robust and accurate latency prediction models for each specific task. To achieve this goal, we make tools that provide a convenient way to conduct massive experiments on different target devices focusing on a mobile GPU. After evaluation of the dataset, one can train the regression model on experimental data and use it for future latency prediction and analysis. We experimentally demonstrate the applicability of such an approach on a subset of the popular NAS-Benchmark 101 dataset for two different mobile GPU.
Highlights
Algorithms based on convolutional neural networks can achieve high performance in numerous computer vision tasks, such as image recognition [1,2], object detection, segmentation [3], and many other areas [4]
We show the construction and analysis of the dataset we constructed for two mobile devices and the subset of neural architecture search area (NAS)-Bench 101 search space
We focus on a mobile GPU
Summary
Algorithms based on convolutional neural networks can achieve high performance in numerous computer vision tasks, such as image recognition [1,2], object detection, segmentation [3], and many other areas [4]. A lot of applications require computer vision problems to be solved in real-time at the end devices, such as mobile phones, embedded devices, car computers, etc. All those devices have their architecture, hardware, and software. Fast and accurate ShuffleNet [5] achieved actual speedup at Qualcomm Snapdragon 820 processor is more than 1.5× less than theoretical in comparison with MobileNet [6] It is a quite widespread phenomenon; more examples can be found on TensorFlow [7] Lite (TFLite) benchmark comparison [8]. More results of TensorFlow Lite performance benchmarks when running well-known models on some Android and iOS devices can be found on https://www.tensorflow.org/lite/performance/benchmarks (accessed on 19 August 2021)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.