Abstract

Video data are of two different intrinsic modes, in‐frame and temporal. It is beneficial to incorporate static in‐frame features to acquire dynamic features for video applications. However, some existing methods such as recurrent neural networks do not have a good performance, and some other such as 3D convolutional neural networks (CNNs) are both memory consuming and time consuming. This study proposes an effective framework that takes the advantage of deep learning on the static image feature extraction to tackle the video data. After extracting in‐frame feature vectors using a pretrained deep network, the authors integrate them and form a multi‐mode feature matrix, which preserves the multi‐mode structure and high‐level representation. They propose two models for follow‐up classification. The authors first introduce a temporal CNN, which directly feeds the multi‐mode feature matrix into a CNN. However, they show that characteristics of the multi‐mode features differ significantly in distinct modes. The authors therefore further propose the multi‐mode neural network (MMNN), in which different modes deploy different types of layers. They evaluate their algorithm with the task of human action recognition. The experimental results show that the MMNN achieves a much better performance than the existing long short‐term memory‐based methods and consumes far fewer resources than the existing 3D end‐to‐end models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call