Abstract
Human action recognition is an important research topic in the field of computer vision due to its application values. Recently, a variety of approaches based on deep learning features have been proposed due to the effectiveness of deep neural networks. But most of these approaches are not able to fully extract spatiotemporal features from videos, because of the lack of consideration of the diversity of scales in temporal domain. In this paper, we propose a two-stream convolutional network with long-short-term spatiotemporal features (LSF CNN) for human action recognition task. The network is mainly composed of two subnetworks. One is long-term spatiotemporal features extraction network (LT-Net) that takes the stacked RGB images as inputs. Another one is short-term spatiotemporal features extraction network (ST-Net) that takes the optical flow as input, which is estimated from two adjacent frames. The two-scale spatiotemporal features are fused in the fully-connected layer and fed into the linear support vector machine (SVM). We also propose a new expression for optical flow field, which is proved to have better performance than traditional expression in action recognition problem. With two-stream architecture, the network can fully learn deep features in both spatial and temporal domains. The experimental results on HMDB51 and UCF101 datasets indicated that the proposed approach improves the action recognition accuracy by using the long-short-term spatiotemporal information.
Highlights
Human action recognition, which classifies the human behaviors in videos, is one of the most important active areas in computer vision
We propose a novel action recognition approach with long-short-term spatiotemporal features extraction convolutional network (LSF Convolutional Neural Networks (CNN)), which is based on 3D convolutional network and 2D convolutional network
We report the average accuracy over the splits on both HMDB51 and UCF101
Summary
Human action recognition, which classifies the human behaviors in videos, is one of the most important active areas in computer vision. Some action recognition approaches took motion information into account, they usually took the single scale optical flow image or single scale image sequence as the input sample These approaches do not take the diversity of behavior scales in temporal domain account into. We propose a novel action recognition approach with long-short-term spatiotemporal features extraction convolutional network (LSF CNN), which is based on 3D convolutional network and 2D convolutional network. The proposed LSF CNN is a novel framework that uses a two-stream structure to build a convolutional network from video data to action classification results. It can be used as a common feature extractor to complete other visual recognition task.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.