An End to End Framework With Adaptive Spatio-Temporal Attention Module for Human Action Recognition

Yibin Li,Xin Ma,Shaocan Liu,Hanbo Wu

doi:10.1109/access.2020.2979549

Abstract

Human action recognition is a challenging task in computer vision. Modeling spatial-temporal information in videos effectively is crucial for the performance improvement of action recognition. In this paper, we introduce an end to end framework, Spatio-Temporal Attention ConvNet (STACNet), which combines two novel attention modules and convolutional neural networks for action recognition. Two novel attention modules, Spatial Attention Module (SAM) and Temporal Attention Module (TAM), are proposed respectively. Spatial Attention Module is established by fusing the value feature and the gradient feature of the feature map, making the representation of ConvNet for action recognition focus on the informative motion regions of actions. Temporal Attention Module is built by combining global average pooling and global max pooling to explore key frames in videos. With the two attention modules, STACNet can adaptively distinguish key frames in a sequence and selectively pay different levels of attention to different spatial motion regions of human actions, at virtually negligible increase in computation cost. We demonstrate the effectiveness of SAM and TAM for action recognition, respectively. The experimental results show that STACNet can obtain superior performance on the datasets of HMDB51 and UCF101.

Highlights

Human action recognition is a fundamental and important task in computer vision, owing to its applications in many areas including video content analysis, video surveillance, human computer interaction [1]
We introduce an end to end framework for action recognition by embedding two novel attention modules into state-of-the-art convolutional architectures, called Spatio-Temporal Attention convolution neural network (ConvNet) (STACNet)
ACTION RECOGNITION WITH SPATIO-TEMPORAL ATTENTION ConvNet we introduce the details of performing action recognition with Spatio-Temporal Attention ConvNet

Summary

Introduction

Human action recognition is a fundamental and important task in computer vision, owing to its applications in many areas including video content analysis, video surveillance, human computer interaction [1]. A lot of remarkable progress has been made in this field in recent years. It is still a challenging problem due to complex backgrounds, intra-class variations, and low resolution and high dimension [2]. Due to its powerful representation ability of image features, convolution neural network (ConvNet) has been widely used in image classification [3], [4], target detection [5], image segmentation [6], and it is undeniably that ConvNet is a powerful tool for action recognition.

Objectives

Methods

Discussion

Conclusion