Abstract

Temporal modeling is the key for action recognition in videos, but traditional 2D CNNs do not capture temporal relationships well. 3D CNNs can achieve good performance, but are computationally intensive and not well practiced on existing devices. Based on these problems, we design a generic and effective module called spatio-temporal motion network (SMNet). SMNet maintains the complexity of 2D and reduces the computational effort of the algorithm while achieving performance comparable to 3D CNNs. SMNet contains a spatio-temporal excitation module (SE) and a motion excitation module (ME). The SE module uses group convolution to fuse temporal information to reduce the number of parameters in the network, and uses spatial attention to extract spatial information. The ME module uses the difference between adjacent frames to extract feature-level motion patterns between adjacent frames, which can effectively encode motion features and help identify actions efficiently. We use ResNet-50 as the backbone network and insert SMNet into the residual blocks to form a simple and effective action network. The experiment results on three datasets, namely Something-Something V1, Something-Something V2, and Kinetics-400, show that it out performs state-of-the-arts motion recognition networks.

Highlights

  • Action recognition is most important for video understanding, which aims to enable the computer to accurately understand the video content and classify the video

  • We propose a spatio-temporal motion network, which combines spatiotemporal information and motion information, which can be integrated into ResNet network, and can recognize action and efficiently

  • Our model framework is compared with the existing action recognition methods on the datasets of Something-Something V1, Something-Something V2 and Kinetics-400

Read more

Summary

Introduction

Action recognition is most important for video understanding, which aims to enable the computer to accurately understand the video content and classify the video. The video action recognition method maps the motion information and spatial information of the original video data to the feature space to obtain the feature expression of the video, and realizes the accurate classification of the action in the video according to the feature descriptor. How to extract the action information that can accurately represent the video content is the key problem in the task of video action recognition. 2D-based action recognition mostly extracts motion features through optical flow, which requires additional cost. 3D-based action recognition has higher performance than 2D, but is computationally intensive and not well suited for practical applications. There is a need for an operational action recognition method in a practical environment

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.