Abstract

The same action takes different time in different cases. This difference will affect the accuracy of action recognition to a certain extent. We propose an end-to-end deep neural network called “Multi-Term Attention Networks” (MTANs), which solves the above problem by extracting temporal features with different time scales. The network consists of a Multi-Term Attention Recurrent Neural Network (MTA-RNN) and a Spatio-Temporal Convolutional Neural Network (ST-CNN). In MTA-RNN, a method for fusing multi-term temporal features are proposed to extract the temporal dependence of different time scales, and the weighted fusion temporal feature is recalibrated by the attention mechanism. Ablation research proves that this network has powerful spatio-temporal dynamic modeling capabilities for actions with different time scales. We perform extensive experiments on four challenging benchmark datasets, including the NTU RGB+D dataset, UT-Kinect dataset, Northwestern-UCLA dataset, and UWA3DII dataset. Our method achieves better results than the state-of-the-art benchmarks, which demonstrates the effectiveness of MTANs.

Highlights

  • Human Action Recognition (HAR) has attracted the attention of research communities in the computer vision area in recent years

  • The problem is that the general action recognition method can only extract single-term temporal features, and the ability to model the spatio-temporal dynamics of actions with different time scale is limited

  • For the input skeleton sequences, a multi-LSTM module based on temporal sliding is used to capture temporal features information for input action sequences at different terms, and the temporal feature is calibrated by the Attention Recalibration Module (ARM)

Read more

Summary

Introduction

Human Action Recognition (HAR) has attracted the attention of research communities in the computer vision area in recent years. The problem is that the general action recognition method can only extract single-term temporal features, and the ability to model the spatio-temporal dynamics of actions with different time scale is limited. The Multi-Term Temporal Sliding LSTM (MT-TS-LSTM) is introduced in MTA-RNN for extracting features with different time scales. We propose a general Multi-Term Attention Networks (MTANs) for skeleton-based action recognition. It introduces a method for fusing multi-term temporal features to solve the action recognitions problem of actions with large time-scale differences. Networks with our strategy are able to reinforce temporal features for classifying actions

Research Significance
Related Work
Methods
Multi-Term Attention Recurrent Neural Network
Multi-Term Temporal Sliding LSTM
Attention Recalibration Based on Fusion Features
Spatio-Temporal Convolution Neural Network
Experiments and Analysis
Datasets
UT-Kinect Action Dataset
Northwestern-UCLA Dataset
UWA3DII Dataset
Experiment Design
MTA-RNN and ST-CNN
Effectiveness of Each Module in MTA-RNN
Feature Concatenation and Weighted Feature Fusion
Combined Model and MTANs
Experimental Results
UT-Kinect Action Dataset Results
Northwestern-UCLA Dataset Results
UWA3DII Dataset Results
Analysis of Results
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.