Abstract

In recent years, deep neural networks have become the state of the art for a variety of tasks such as image classification, object detection and localization, and speech recognition. Due to the significant computational load, deep neural networks are traditionally trained on high-power CPU or GPU systems while inference can be performed on less powerful hardware. However, deployment of trained deep neural networks on low-power embedded hardware remains challenging. This is because state-of-the-art deep neural networks have a large memory footprint and require many floating point operations, and because efficient compression schemes – while being effective – require custom datatypes and compilers which limit their applicability. In this article, we present a benchmark comparison of common deep learning architectures and a novel architecture for deployment on different embedded platforms showing that further gains in accuracy and speed-up can be achieved with an optimized network design. We also propose a generalized fixed-point conversion scheme that does not require custom datatypes or compilers. Both are of interest for engineering applications as computational power in applications and expert knowledge about efficient neural network implementations are typically limited.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call