Abstract

Aiming at the problem of unbalanced distribution of spatio-temporal information in video images, this paper proposes a 2D/3D hybrid convolutional network that introduces attention mechanism, which fully captures video space information and dynamic motion information, and better reveals motion features. With the help of the dual-stream convolutional network structure, we built 2D convolution and 3D convolution parallel neural networks. In the 2D convolutional neural network, the residual structure and the LSTM network model are used to focus on the spatial feature information of the video behavior. Secondly, the 3D convolutional neural network constructed by Inception structure is used to extract the spatiotemporal feature information of video behavior. On the basis of the two high-level semantics extracted, the attention mechanism is introduced to fuse the features. Finally, the obtained significant feature vector is used for video behavior recognition. Compared with other network models on the UCF101 and HMDB51 datasets, it can be seen from the results that the proposed 2D/3D hybrid convolutional network has good recognition performance and robustness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call