Abstract

Video object detection and action recognition typically require deep neural networks (DNNs) with huge number of parameters. It is thereby challenging to develop a DNN video comprehension unit in resource-constrained terminal devices. In this article, we introduce a deeply tensor-compressed video comprehension neural network, called DEEPEYE, for inference on terminal devices. Instead of building a Long Short-Term Memory (LSTM) network directly from high-dimensional raw video data input, we construct an LSTM-based spatio-temporal model from structured, tensorized time-series features for object detection and action recognition. A deep compression is achieved by tensor decomposition and trained quantization of the time-series feature-based LSTM network. We have implemented DEEPEYE on an ARM-core-based IOT board with 31 FPS consuming only 2.4W power. Using the video datasets MOMENTS, UCF11 and HMDB51 as benchmarks, DEEPEYE achieves a 228.1× model compression with only 0.47% mAP reduction; as well as 15 k × parameter reduction with up to 8.01% accuracy improvement over other competing approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.