A Multi-Feature Representation of Skeleton Sequences for Human Interaction Recognition

Xiaohang Wang,Hongmin Deng

doi:10.3390/electronics9010187

Abstract

Inspired from the promising performances achieved by recurrent neural networks (RNN) and convolutional neural networks (CNN) in action recognition based on skeleton, this paper presents a deep network structure which combines both CNN for classification and RNN to achieve attention mechanism for human interaction recognition. Specifically, the attention module in this structure is utilized to give various levels of attention to various frames by different weights, and the CNN is employed to extract the high-level spatial and temporal information of skeleton data. These two modules seamlessly form a single network architecture. In addition, to eliminate the impact of different locations and orientations, a coordinate transformation is conducted from the original coordinate system to the human-centric coordinate system. Furthermore, three different features are extracted from the skeleton data as the inputs of three subnetworks, respectively. Eventually, these subnetworks fed with different features are fused as an integrated network. The experimental result shows the validity of the proposed approach on two widely used human interaction datasets.

Highlights

Human action recognition and interaction recognition have recently attracted the intensive attention of researchers in computer vision field due to its extensive application prospects, such as intelligent surveillance, human-machine interaction, and so on
This dataset is a human interaction recognition dataset captured by Kinect datasets: SBU Interaction Dataset [9] and NTU RGB + D Dataset [10]
The human-centric coordinate system could eliminate this influence of different perspectives of actions, which verifies the availability of coordinate transformation

Summary

Introduction

Human action recognition and interaction recognition have recently attracted the intensive attention of researchers in computer vision field due to its extensive application prospects, such as intelligent surveillance, human-machine interaction, and so on. Most previous methods are devoted to the human action recognition in two-dimensional RGB data [1,2]. Due to the high sensitivity to environmental variability of the RGB data, precise action recognition is a challenging task. In [3], Rezazadegan et al proposed an action region proposal method that, informed by optical flow to extract image regions likely to contain actions, which can eliminate the influence of background. This problem could be overcome by using cost-efficient RGB-D (i.e., color plus depth) sensors [5]

Methods

Results

Conclusion