Abstract

Conventional approaches for video action recognition were designed to learn feature maps using 3D convolutional neural networks (CNNs). For better action recognition, they trained the large-scale video datasets with the representation power of 3D CNN. However, action recognition is still a challenging task. Since the previous methods rarely distinguish human body from environment, they often overfit background scenes. Note that separating human body from background allows to learn distinct representations of human action. This paper proposes a novel attention module aiming at only action part(s), while neglecting non-action part(s) such as background. First, the attention module employs triplet loss to differentiate active features from non-active or less active features. Second, two attention modules based on spatial and channel domains are proposed to enhance the feature representation ability for action recognition. The spatial attention module is to learn spatial correlation of features, and the channel attention module is to learn channel correlation. Experimental results show that the proposed method achieves state-of-the-art performance of 41.41% and 55.21% on Diving48 and Something-V1 datasets, respectively. In addition, the proposed method provides competitive performance even on UCF101 and HMDB-51 datasets, i.e., 95.83% on UCF-101 and 74.33% on HMDB-51.

Highlights

  • Action recognition which is one of crucial tasks of videobased computer vision is becoming popular in various applications such as media analysis, robotics, and video surveillance

  • 2) Discriminative learning based on triplet loss is not used here because the objective of triplet loss based on spatial features is different from that of channel attention module (CAM) based on channel relationships

  • We propose a double attention (DA) module that generates an attention map in consideration of spatiotemporal information and enables a triplet loss-based discriminative learning

Read more

Summary

INTRODUCTION

Action recognition which is one of crucial tasks of videobased computer vision is becoming popular in various applications such as media analysis, robotics, and video surveillance. This paper proposes the attention module that can produce independent features of the background and the action area, and presents a learning method that discriminates the features of the generated attention maps. Since video-based action recognition is the main task, an attention (feature) map is generated by considering spatial information as well as channel information [10,11]. A. Overall Approach Section I qualitatively demonstrated that discriminative learning of attention maps is beneficial for action recognition. Based on this fact, we propose to create attention maps that fully utilize spatio-temporal information, and define geometric similarity relationships for their discriminative learning.

Spatial Attention Module
Channel Attention Module
Triplet loss for Attention Feature Learning
EXPERIMENTS
Quantitative Results
Further Analysis using Activation Map Visualization
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call