Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Zhensheng Shi,Liangjie Cao,Haiyong Zheng,Zhibin Yu,Cheng Guan,Bing Zheng,Zhaorui Gu

doi:10.1109/access.2020.2968024

Abstract

Learning spatiotemporal features via 3D-CNN (3D Convolutional Neural Network) models has been regarded as an effective approach for action recognition. In this paper, we explore visual attention mechanism for video analysis and propose a novel 3D-CNN model, dubbed AE-I3D (Attention-Enhanced Inflated-3D Network), for learning attention-enhanced spatiotemporal representation. The contribution of our AE-I3D is threefold: First, we inflate soft attention in spatiotemporal scope for 3D videos, and adopt softmax to generate probability distribution of attentional features in a feedforward 3D-CNN architecture; Second, we devise an AE-Res (Attention-Enhanced Residual learning) module, which learns attention-enhanced features in a two-branch residual learning way, also the AE-Res module is lightweight and flexible, so that can be easily embedded into many 3D-CNN architectures; Finally, we embed multiple AE-Res modules into an I3D (Inflated-3D) network, yielding our AE-I3D model, which can be trained in an end-to-end, video-level manner. Different from previous attention networks, our method inflates residual attention from 2D image to 3D video for 3D attention residual learning to enhance spatiotemporal representation. We use RGB-only video data for evaluation on three benchmarks: UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our AE-I3D is effective with competitive performance.

Highlights

Action recognition is a challenging task in video understanding research
Inspired by previous works on 3D-CNN and soft attention mechanism, we propose Attention-Enhanced I3D network, dubbed AE-I3D, to enhance the spatiotemporal representation for action recognition
Different from residual attention network for image classification [20], we explore the residual attention from 2D image to 3D video for enhancing spatiotemporal representation

Summary

Introduction

Action recognition is a challenging task in video understanding research. More and more algorithms and techniques [1]–[12] have been developed to recognize human actions in trimmed videos, where each video contains a single action and all the clips have a standard duration. For the action recognition study, two-stream [13] and 3D-CNN [2], [3] are two main architectures dealing with this task. The opening of new large-scale datasets, such as Kinetics [14], has made action recognition more and more challenging. New methods with better performance have been developed . I3D (Inflated-3D) [9] has been proved to be an effective 3D-CNN architecture, which inflates 2D-CNN model via expanding 2D filters and

Objectives

Methods

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 8	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Decoding micro-electrocorticographic signals by using explainable 3D convolutional neural network to predict finger movements
Chao-Hung Kuo ... Jeffrey G Ojemann
Journal of Neuroscience Methods | VOL. 411
Chao-Hung Kuo, et. al.Chao-Hung Kuo ... Jeffrey G Ojemann
01 Aug 2024
Journal of Neuroscience Methods | VOL. 411

Classification of schizophrenia and normal controls using 3D convolutional neural network and outcome visualization
Kanghan Oh ... Young Chul Chung
Schizophrenia Research | VOL. 212
Kanghan Oh, et. al.Kanghan Oh ... Young Chul Chung
06 Aug 2019
Schizophrenia Research | VOL. 212

Inter-Dimensional Correlations Aggregated Attention Network for Action Recognition
Xiaochao Li ... Man Yang
IEEE Access | VOL. 9
Xiaochao Li, et. al.Xiaochao Li ... Man Yang
01 Jan 2020
IEEE Access | VOL. 9

FMRI volume classification using a 3D convolutional neural network robust to shifted and scaled neuronal activations
Hanh Vu ... Jong-Hwan Lee
NeuroImage | VOL. 223
Hanh Vu, et. al.Hanh Vu ... Jong-Hwan Lee
05 Sep 2020
NeuroImage | VOL. 223

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning Attention-Enhanced Spatiotemporal Representation for Action Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access