Metric-Based Attention Feature Learning for Video Action Recognition

Dae Ha Kim,Fazliddin Anvarov,Jun Min Lee,Byung Cheol Song

doi:10.1109/access.2021.3064934

Dae Ha Kim, Fazliddin Anvarov + Show 2 more

Open Access

https://doi.org/10.1109/access.2021.3064934

Copy DOI

Abstract

Conventional approaches for video action recognition were designed to learn feature maps using 3D convolutional neural networks (CNNs). For better action recognition, they trained the large-scale video datasets with the representation power of 3D CNN. However, action recognition is still a challenging task. Since the previous methods rarely distinguish human body from environment, they often overfit background scenes. Note that separating human body from background allows to learn distinct representations of human action. This paper proposes a novel attention module aiming at only action part(s), while neglecting non-action part(s) such as background. First, the attention module employs triplet loss to differentiate active features from non-active or less active features. Second, two attention modules based on spatial and channel domains are proposed to enhance the feature representation ability for action recognition. The spatial attention module is to learn spatial correlation of features, and the channel attention module is to learn channel correlation. Experimental results show that the proposed method achieves state-of-the-art performance of 41.41% and 55.21% on Diving48 and Something-V1 datasets, respectively. In addition, the proposed method provides competitive performance even on UCF101 and HMDB-51 datasets, i.e., 95.83% on UCF-101 and 74.33% on HMDB-51.

Highlights

Action recognition which is one of crucial tasks of videobased computer vision is becoming popular in various applications such as media analysis, robotics, and video surveillance
2) Discriminative learning based on triplet loss is not used here because the objective of triplet loss based on spatial features is different from that of channel attention module (CAM) based on channel relationships
We propose a double attention (DA) module that generates an attention map in consideration of spatiotemporal information and enables a triplet loss-based discriminative learning

Summary

INTRODUCTION

Action recognition which is one of crucial tasks of videobased computer vision is becoming popular in various applications such as media analysis, robotics, and video surveillance. This paper proposes the attention module that can produce independent features of the background and the action area, and presents a learning method that discriminates the features of the generated attention maps. Since video-based action recognition is the main task, an attention (feature) map is generated by considering spatial information as well as channel information [10,11]. A. Overall Approach Section I qualitatively demonstrated that discriminative learning of attention maps is beneficial for action recognition. Based on this fact, we propose to create attention maps that fully utilize spatio-temporal information, and define geometric similarity relationships for their discriminative learning.

Spatial Attention Module

Channel Attention Module

Triplet loss for Attention Feature Learning

EXPERIMENTS

Quantitative Results

Further Analysis using Activation Map Visualization

Findings

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2021
Citations: 37	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Metric-Based Attention Feature Learning for Video Action Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

One Spatio-Temporal Sharpening Attention Mechanism for Light-Weight YOLO Models Based on Sharpening Spatial Attention.
Mengfan Xue ... Yunfei Guo
Sensors (Basel, Switzerland) | VOL. 21
Mengfan Xue, et. al.Mengfan Xue ... Yunfei Guo
28 Nov 2021
Sensors (Basel, Switzerland) | VOL. 21

3D Residual Networks with Channel-Spatial Attention Module for Action Recognition
Ziwen Yi ... Jinchao Feng
-
Ziwen Yi, et. al.Ziwen Yi ... Jinchao Feng
06 Nov 2020
06 Nov 2020

Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition.
Bo Chen ... Fangzhou Meng
Sensors (Basel, Switzerland) | VOL. 23
Bo Chen, et. al.Bo Chen ... Fangzhou Meng
03 Feb 2023
Sensors (Basel, Switzerland) | VOL. 23

Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention
Fazliddin Anvarov ... Dae Ha Kim
Electronics | VOL. 9
Fazliddin Anvarov, et. al.Fazliddin Anvarov ... Dae Ha Kim
12 Jan 2020
Electronics | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Metric-Based Attention Feature Learning for Video Action Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions