ANNETTE: Accurate Neural Network Execution Time Estimation With Stacked Models

Matthias Wess,Matvey Ivanov,Anvesh Nookala,Axel Jantsch,Christoph Unger,Alexander Wendt

doi:10.1109/access.2020.3047259

Abstract

With new accelerator hardware for DNN, the computing power for AI applications has increased rapidly. However, as DNN algorithms become more complex and optimized for specific applications, latency requirements remain challenging, and it is critical to find the optimal points in the design space. To decouple the architectural search from the target hardware, we propose a time estimation framework that allows for modeling the inference latency of DNNs on hardware accelerators based on mapping and layer-wise estimation models. The proposed methodology extracts a set of models from micro-kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation. We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation. We test the mixed models on the ZCU102 SoC board with DNNDK and Intel Neural Compute Stick 2 on a set of 12 state-of-the-art neural networks. It shows an average estimation error of 3.47% for the DNNDK and 7.44% for the NCS2, outperforming the statistical and analytical layer models for almost all selected networks. For a randomly selected subset of 34 networks of the NASBench dataset, the mixed model reaches fidelity of 0.988 in Spearman's rank correlation coefficient metric. The code of ANNETTE is publicly available at https://github.com/embedded-machine-learning/annette.

Highlights

Deep Neural Networks have become key components in many Artificial Intelligence (AI) applications, including autonomous driving [1], medical diagnosis [2], [3] and machine translation [4]
Attempting to close the gap between the computational intensity of Deep Neural Networks (DNNs) and the available computing power, a wide variety of hardware accelerators for DNNs and other AI workloads have emerged in recent years
EXPERIMENTAL SETUP All experiments were performed with batch size 1 to achieve the lowest possible latency, but by adding the batch-size as an additional input parameter for the benchmark dataset and by adding the batch size to the input feature vector of the estimation models, it would be possible to extend the method to larger batch sizes

Summary

INTRODUCTION

Deep Neural Networks have become key components in many AI applications, including autonomous driving [1], medical diagnosis [2], [3] and machine translation [4]. Computational efficiency depends largely on the specific architectural parameters of each layer and the hardware platform used [14]. We can see the high variance of the effective compute performance for a variety of different network architectures when executed on the same hardware. There have been some recent attempts to predict network latency and performance on different hardware platforms. We propose a framework for the generation of stacked, mapping models and layer models to estimate the network execution time To our knowledge, this is the first work in which the different approaches to modeling layer execution time and mapping models are systematically investigated and evaluated on a broad range of network architectures. Our evaluation of the generated mapping models and layer models on a set of 12 stateof-the-art models show a mean absolute percentage error of 3.41% for the ZCU102 We compare mixed layer models with statistical layer models, the roofline model, and a refined roofline model in terms of accuracy and fidelity

RELATED WORK

BENCHMARK TOOL

MODEL GENERATOR

LAYER EXECUTION TIME MODELS

ESTIMATION

RESULTS AND PERFORMANCE

VIII. CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 30	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

ANNETTE: Accurate Neural Network Execution Time Estimation With Stacked Models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Quantitative Comparison of Yttrium-90 (90 Y)-Microspheres and Technetium-99m (99m Tc)-Macroaggregated Albumin SPECT Images for Planning 90 Y Therapy of Liver Cancer
Karin Knešaurek ... Maria Dacosta
Technology in Cancer Research & Treatment | VOL. 9
Karin Knešaurek, et. al.Karin Knešaurek ... Maria Dacosta
01 Jun 2010
Technology in Cancer Research & Treatment | VOL. 9

In Vivo Performance of Visual Criteria, Laser-Induced Fluorescence, and Light-Induced Fluorescence for Early Caries Detection.
Antonis Perdiou ... Vlad Tiberiu Alexa
Diagnostics | VOL. 13
Antonis Perdiou, et. al.Antonis Perdiou ... Vlad Tiberiu Alexa
11 Oct 2023
Diagnostics | VOL. 13

Neuroradiology diagnostic errors at a tertiary academic centre: effect of participation in tumour boards and physician experience
V Ivanovic ... L Qi
Clinical Radiology | VOL. 77
V Ivanovic, et. al.V Ivanovic ... L Qi
16 May 2022
Clinical Radiology | VOL. 77

Variability in the correlation between nicotine and PM2.5 as airborne markers of second-hand smoke exposure
Marcela Fu ... Esteve Fernández
Environmental Research | VOL. 127
Marcela Fu, et. al.Marcela Fu ... Esteve Fernández
28 Oct 2013
Environmental Research | VOL. 127

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ANNETTE: Accurate Neural Network Execution Time Estimation With Stacked Models

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access