An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

Dan Liu,Yan Gan,Yunfeng Ji,Jianwei Zhang,Mao Ye

doi:10.1109/access.2020.2983355

Dan Liu, Yan Gan + Show 3 more

Open Access

https://doi.org/10.1109/access.2020.2983355

Copy DOI

Abstract

Action recognition is an important yet challenging task in computer vision. Attention mechanism not only tells where to focus but when to focus. It plays a key role in extracting discriminative spatial and temporal features for solving the task. In this paper, we propose an improved spatiotemporal attention model based on the two-stream structure to recognize the different actions in videos. Specifically, we first extract the intra-frame spatial features and inter-frame optical flow features for each video data. Then we implement an effective attention module, which sequentially infers attention maps along three separate dimensions: channel, spatial and temporal. After adaptive feature refinement based on the attention maps, we perform a temporal pooling process to squeeze the temporal dimension. Then, these achieved spatial and temporal features are fed into the spatial LSTM and temporal LSTM, respectively. Finally, we fuse the spatial feature, temporal feature and two-stream fusion feature to classify the actions in videos. Additionally, we also collect and construct a new Ping-Pong action dataset for subsequent human-robot interaction task from YouTube. It contains 2400 labeled videos for 4 categories. We compare with other action recognition algorithms and validate the feasibility and effectiveness of the proposed method on Ping-Pong action dataset and HMDB51 dataset.

Highlights

Video action recognition aims to predict the action type labels from videos, and it has drawn increasing attention considering its potential applications in many fields, e.g., assisted life, human-robot interaction or intelligent video surveillance
The main contributions of this work are summarized as follows: 1) We propose a deep learning framework for video action recognition which can explicitly capture spatiotemporal information based on three-dimension attention module
MODEL ARCHITECTURE This paper proposes an iCBAM-based spatiotemporal-stream model for action recognition in videos

Summary

INTRODUCTION

Video action recognition aims to predict the action type labels from videos, and it has drawn increasing attention considering its potential applications in many fields, e.g., assisted life, human-robot interaction or intelligent video surveillance. Zang et al [8] suggest utilizing attention-based temporal weighted CNN to learn action features All these methods are trying to find an effective model that can identify distinguished spatiotemporal information of video data. Motivated by all the above, we propose a three-dimension attention-based spatiotemporal-stream model for the action recognition task The aim of this model is selectively considering the information over the spatial, channel and temporal features simultaneously. The main contributions of this work are summarized as follows: 1) We propose a deep learning framework for video action recognition which can explicitly capture spatiotemporal information based on three-dimension attention module. We will introduce each module sequentially based on the proposed network architecture (Fig.1) They are the Network Inputs, Improved Attention Mechanism (iCBAM), Temporal Segment, Temporal and Spatial LSTMs. A. We will discuss the experimental details and results of the proposed method for action recognition in part

EXPERIMENTS

Findings

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Spatial-temporal interaction learning based two-stream network for action recognition
Tianyu Liu ... Ping Jiang
Information Sciences | VOL. 606
Tianyu Liu, et. al.Tianyu Liu ... Ping Jiang
28 May 2022
Information Sciences | VOL. 606

Attentive spatial-temporal contrastive learning for self-supervised video representation
Xingming Yang ... Zhao Xie
Image and Vision Computing | VOL. 137
Xingming Yang, et. al.Xingming Yang ... Zhao Xie
11 Jul 2023
Image and Vision Computing | VOL. 137

Joint Attentive Spatial-Temporal Feature Aggregation for Video-Based Person Re-Identification
Lin Chen ... Hua Yang
IEEE Access | VOL. 7
Lin Chen, et. al.Lin Chen ... Hua Yang
01 Jan 2019
IEEE Access | VOL. 7

Learning Long-Term Temporal Features With Deep Neural Networks for Human Action Recognition
Sheng Yu ... Daoxun Xia
IEEE Access | VOL. 8
Sheng Yu, et. al.Sheng Yu ... Daoxun Xia
01 Jan 2020
IEEE Access | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Improved Attention-Based Spatiotemporal-Stream Model for Action Recognition in Videos

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access