Abstract

ABSTRACT Being simple and portable, the three-dimensional (3D) convolution network has achieved great success in action recognition. However, its applicability in spatiotemporal feature learning is not evident. This study aims to improve the 3D convolution model and propose a flexible and significant attention module for the extraction of spatiotemporal information. Our first contribution is a self-additive attention module and a feature-based attention module, which is a simple yet effective method for measuring the spatiotemporal importance of a video. In self-additive attention, the spatiotemporal fusion between the frames is defined intuitively, where we set equivalent weights between the video frames manually. Further, the feature-based attention that is trained adaptively by the 3D convolution process combines the spatiotemporal information from the feature map. This study also focuses on attention fusion in learning the spatiotemporal characteristics for 3D convolution. The proposed attention fusion method exhibits outstanding performance in comparison to the recently developed attention modules and the latest 3D networks when applied to the data from the UCF101 and HMDB51 datasets. The experiments show consistent improvements, affirming the robustness of the method in extracting spatiotemporal attention.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call