Abstract
With new accelerator hardware for DNN, the computing power for AI applications has increased rapidly. However, as DNN algorithms become more complex and optimized for specific applications, latency requirements remain challenging, and it is critical to find the optimal points in the design space. To decouple the architectural search from the target hardware, we propose a time estimation framework that allows for modeling the inference latency of DNNs on hardware accelerators based on mapping and layer-wise estimation models. The proposed methodology extracts a set of models from micro-kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation. We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation. We test the mixed models on the ZCU102 SoC board with DNNDK and Intel Neural Compute Stick 2 on a set of 12 state-of-the-art neural networks. It shows an average estimation error of 3.47% for the DNNDK and 7.44% for the NCS2, outperforming the statistical and analytical layer models for almost all selected networks. For a randomly selected subset of 34 networks of the NASBench dataset, the mixed model reaches fidelity of 0.988 in Spearman's rank correlation coefficient metric. The code of ANNETTE is publicly available at https://github.com/embedded-machine-learning/annette.
Highlights
Deep Neural Networks have become key components in many Artificial Intelligence (AI) applications, including autonomous driving [1], medical diagnosis [2], [3] and machine translation [4]
Attempting to close the gap between the computational intensity of Deep Neural Networks (DNNs) and the available computing power, a wide variety of hardware accelerators for DNNs and other AI workloads have emerged in recent years
EXPERIMENTAL SETUP All experiments were performed with batch size 1 to achieve the lowest possible latency, but by adding the batch-size as an additional input parameter for the benchmark dataset and by adding the batch size to the input feature vector of the estimation models, it would be possible to extend the method to larger batch sizes
Summary
Deep Neural Networks have become key components in many AI applications, including autonomous driving [1], medical diagnosis [2], [3] and machine translation [4]. Computational efficiency depends largely on the specific architectural parameters of each layer and the hardware platform used [14]. We can see the high variance of the effective compute performance for a variety of different network architectures when executed on the same hardware. There have been some recent attempts to predict network latency and performance on different hardware platforms. We propose a framework for the generation of stacked, mapping models and layer models to estimate the network execution time To our knowledge, this is the first work in which the different approaches to modeling layer execution time and mapping models are systematically investigated and evaluated on a broad range of network architectures. Our evaluation of the generated mapping models and layer models on a set of 12 stateof-the-art models show a mean absolute percentage error of 3.41% for the ZCU102 We compare mixed layer models with statistical layer models, the roofline model, and a refined roofline model in terms of accuracy and fidelity
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.