Comparison between Recurrent Networks and Temporal Convolutional Networks Approaches for Skeleton-Based Action Recognition.

Mihai Nan,Adina Magda Florea,Mihai Trăscău,Cezar Cătălin Iacob

doi:10.3390/s21062051

Mihai Nan, Adina Magda Florea + Show 2 more

Open Access

https://doi.org/10.3390/s21062051

Copy DOI

Abstract

Action recognition plays an important role in various applications such as video monitoring, automatic video indexing, crowd analysis, human-machine interaction, smart homes and personal assistive robotics. In this paper, we propose improvements to some methods for human action recognition from videos that work with data represented in the form of skeleton poses. These methods are based on the most widely used techniques for this problem—Graph Convolutional Networks (GCNs), Temporal Convolutional Networks (TCNs) and Recurrent Neural Networks (RNNs). Initially, the paper explores and compares different ways to extract the most relevant spatial and temporal characteristics for a sequence of frames describing an action. Based on this comparative analysis, we show how a TCN type unit can be extended to work even on the characteristics extracted from the spatial domain. To validate our approach, we test it against a benchmark often used for human action recognition problems and we show that our solution obtains comparable results to the state-of-the-art, but with a significant increase in the inference speed.

Highlights

The problem of recognizing people’s actions is very complex because it depends on many factors
We considered two types of architectures: Recurrent Neural Networks (RNNs) and Temporal Convolutional Networks (TCNs), as the most used architectures for human action recognition from skeletal data
Residual Graph Convolutional Network (ResGCN)-TCN (v1)—For this model, we used the linear rearrangement for the joints, the size of the temporal window was 9 and the size of the spatial window was 3 for all TCN blocks

Summary

Introduction

The problem of recognizing people’s actions is very complex because it depends on many factors. This subject became one of the most important research topics in the field of computer vision due its wide applicability in practical applications. There are several ways in which these movements, that define a human action, can be recorded—as a video clip (a set of RGB images), by recording a series of depth maps, or in the form of a data structure storing the positions of many joints for each time frame, either representing a time-dependant 3D mesh of the visible human body surface or even just a time-dependent graph of articulation points that describes a simplified model of a human skeleton or other combinations. The NTU RGB+D dataset [2], containing samples describing actions and interactions, was used to train the models presented in this paper. The samples that describe interactions are viewed only from the perspective of interacting with another person

Methods

Results

Discussion

Conclusion