Abstract
Two-stream network architecture has the ability to capture temporal and spatial features from videos simultaneously and has achieved excellent performance on video action recognition tasks. However, there is a fair amount of redundant information in both temporal and spatial dimensions in videos, which increases the complexity of network learning. To solve this problem, we propose residual spatial-temporal attention network (R-STAN), a feed-forward convolutional neural network using residual learning and spatial-temporal attention mechanism for video action recognition, which makes the network focus more on discriminative temporal and spatial features. In our R-STAN, each stream is constructed by stacking residual spatial-temporal attention blocks (R-STAB), the spatial-temporal attention modules integrated in the residual blocks have the ability to generate attention-aware features along temporal and spatial dimensions, which largely reduce the redundant information. Together with the specific characteristic of residual learning, we are able to construct a very deep network for learning spatial-temporal information in videos. With the layers going deeper, the attention-aware features from the different R-STABs can change adaptively. We validate our R-STAN through a large number of experiments on UCF101 and HMDB51 datasets. Our experiments show that our proposed network combined with residual learning and spatial-temporal attention mechanism contributes substantially to the performance of video action recognition.
Highlights
Video-based human action recognition is important in many scientific and technological fields, such as intelligent monitoring, public security, human-computer interaction and behavioral analysis, etc., and has gained wide attention of academia in recent years [1]–[7]
Our work comprehensively considers the performance and effectiveness of various action recognition networks, and proposes a two-stream network combined with residual learning [11] and spatial-temporal attention mechanism, which is able to extract and utilize vital spatial-temporal information from long-term structure videos and achieve better performance
The main contributions of this paper are as follows: (1) We propose a spatial-temporal attention module for video action recognition; (2) We propose residual spatial-temporal attention network (R-STAN), a two-stream Convolutional Neural Networks (CNN) architecture that integrates the attention mechanism into Residual Network
Summary
Video-based human action recognition is important in many scientific and technological fields, such as intelligent monitoring, public security, human-computer interaction and behavioral analysis, etc., and has gained wide attention of academia in recent years [1]–[7]. The performance of action recognition system depends to a large extent on whether it can extract and utilize relevant information from the video. The emergence of Convolutional Neural Networks (CNN) has greatly promoted the advancement of image classification, image segmentation, object detection, etc. Many researchers have built various network structures with different depths and widths to extract complex features from images [11], [12]. Video has the characteristic of multiple frames, and the 2D CNNs do not model its time and motion information. We need to develop networks fusing the time information in videos. There are three ways to model time information.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.