Abstract
Conventional approaches for video action recognition were designed to learn feature maps using 3D convolutional neural networks (CNNs). For better action recognition, they trained the large-scale video datasets with the representation power of 3D CNN. However, action recognition is still a challenging task. Since the previous methods rarely distinguish human body from environment, they often overfit background scenes. Note that separating human body from background allows to learn distinct representations of human action. This paper proposes a novel attention module aiming at only action part(s), while neglecting non-action part(s) such as background. First, the attention module employs triplet loss to differentiate active features from non-active or less active features. Second, two attention modules based on spatial and channel domains are proposed to enhance the feature representation ability for action recognition. The spatial attention module is to learn spatial correlation of features, and the channel attention module is to learn channel correlation. Experimental results show that the proposed method achieves state-of-the-art performance of 41.41% and 55.21% on Diving48 and Something-V1 datasets, respectively. In addition, the proposed method provides competitive performance even on UCF101 and HMDB-51 datasets, i.e., 95.83% on UCF-101 and 74.33% on HMDB-51.
Highlights
Action recognition which is one of crucial tasks of videobased computer vision is becoming popular in various applications such as media analysis, robotics, and video surveillance
2) Discriminative learning based on triplet loss is not used here because the objective of triplet loss based on spatial features is different from that of channel attention module (CAM) based on channel relationships
We propose a double attention (DA) module that generates an attention map in consideration of spatiotemporal information and enables a triplet loss-based discriminative learning
Summary
Action recognition which is one of crucial tasks of videobased computer vision is becoming popular in various applications such as media analysis, robotics, and video surveillance. This paper proposes the attention module that can produce independent features of the background and the action area, and presents a learning method that discriminates the features of the generated attention maps. Since video-based action recognition is the main task, an attention (feature) map is generated by considering spatial information as well as channel information [10,11]. A. Overall Approach Section I qualitatively demonstrated that discriminative learning of attention maps is beneficial for action recognition. Based on this fact, we propose to create attention maps that fully utilize spatio-temporal information, and define geometric similarity relationships for their discriminative learning.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.