R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition

Quanle Liu,Xiangjiu Che,Mei Bie

doi:10.1109/access.2019.2923651

Quanle Liu, Xiangjiu Che + Show 1 more

Open Access

https://doi.org/10.1109/access.2019.2923651

Copy DOI

Abstract

Two-stream network architecture has the ability to capture temporal and spatial features from videos simultaneously and has achieved excellent performance on video action recognition tasks. However, there is a fair amount of redundant information in both temporal and spatial dimensions in videos, which increases the complexity of network learning. To solve this problem, we propose residual spatial-temporal attention network (R-STAN), a feed-forward convolutional neural network using residual learning and spatial-temporal attention mechanism for video action recognition, which makes the network focus more on discriminative temporal and spatial features. In our R-STAN, each stream is constructed by stacking residual spatial-temporal attention blocks (R-STAB), the spatial-temporal attention modules integrated in the residual blocks have the ability to generate attention-aware features along temporal and spatial dimensions, which largely reduce the redundant information. Together with the specific characteristic of residual learning, we are able to construct a very deep network for learning spatial-temporal information in videos. With the layers going deeper, the attention-aware features from the different R-STABs can change adaptively. We validate our R-STAN through a large number of experiments on UCF101 and HMDB51 datasets. Our experiments show that our proposed network combined with residual learning and spatial-temporal attention mechanism contributes substantially to the performance of video action recognition.

Highlights

Video-based human action recognition is important in many scientific and technological fields, such as intelligent monitoring, public security, human-computer interaction and behavioral analysis, etc., and has gained wide attention of academia in recent years [1]–[7]
Our work comprehensively considers the performance and effectiveness of various action recognition networks, and proposes a two-stream network combined with residual learning [11] and spatial-temporal attention mechanism, which is able to extract and utilize vital spatial-temporal information from long-term structure videos and achieve better performance
The main contributions of this paper are as follows: (1) We propose a spatial-temporal attention module for video action recognition; (2) We propose residual spatial-temporal attention network (R-STAN), a two-stream Convolutional Neural Networks (CNN) architecture that integrates the attention mechanism into Residual Network

Summary

Introduction

Video-based human action recognition is important in many scientific and technological fields, such as intelligent monitoring, public security, human-computer interaction and behavioral analysis, etc., and has gained wide attention of academia in recent years [1]–[7]. The performance of action recognition system depends to a large extent on whether it can extract and utilize relevant information from the video. The emergence of Convolutional Neural Networks (CNN) has greatly promoted the advancement of image classification, image segmentation, object detection, etc. Many researchers have built various network structures with different depths and widths to extract complex features from images [11], [12]. Video has the characteristic of multiple frames, and the 2D CNNs do not model its time and motion information. We need to develop networks fusing the time information in videos. There are three ways to model time information.

Objectives

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 58	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos.
Wenbin Du ... Yali Wang
IEEE Transactions on Image Processing | VOL. 27
Wenbin Du, et. al.Wenbin Du ... Yali Wang
29 Nov 2017
IEEE Transactions on Image Processing | VOL. 27

Prediction of Pollutant Concentration Based on Spatial-Temporal Attention, ResNet and ConvLSTM.
Cai Chen ... Dong Li
Sensors | VOL. 23
Cai Chen, et. al.Cai Chen ... Dong Li
31 Oct 2023
Sensors | VOL. 23

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos
Xiao Liu ... Xudong Yang
-
Xiao Liu, et. al.Xiao Liu ... Xudong Yang
01 Jan 2018
01 Jan 2018

Spatial-temporal interaction learning based two-stream network for action recognition
Tianyu Liu ... Ping Jiang
Information Sciences | VOL. 606
Tianyu Liu, et. al.Tianyu Liu ... Ping Jiang
28 May 2022
Information Sciences | VOL. 606

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access