Attention-based 3D convolutional networks

Enjie Ding,Dawei Xu,Yingfei Zhao,Zhongyu Liu,Yafeng Liu

doi:10.1080/0952813x.2021.1960625

Enjie Ding, Dawei Xu + Show 3 more

https://doi.org/10.1080/0952813x.2021.1960625

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

ABSTRACT Being simple and portable, the three-dimensional (3D) convolution network has achieved great success in action recognition. However, its applicability in spatiotemporal feature learning is not evident. This study aims to improve the 3D convolution model and propose a flexible and significant attention module for the extraction of spatiotemporal information. Our first contribution is a self-additive attention module and a feature-based attention module, which is a simple yet effective method for measuring the spatiotemporal importance of a video. In self-additive attention, the spatiotemporal fusion between the frames is defined intuitively, where we set equivalent weights between the video frames manually. Further, the feature-based attention that is trained adaptively by the 3D convolution process combines the spatiotemporal information from the feature map. This study also focuses on attention fusion in learning the spatiotemporal characteristics for 3D convolution. The proposed attention fusion method exhibits outstanding performance in comparison to the recently developed attention modules and the latest 3D networks when applied to the data from the UCF101 and HMDB51 datasets. The experiments show consistent improvements, affirming the robustness of the method in extracting spatiotemporal attention.

Full Text