Abstract

Skeleton-based action recognition has advanced significantly in the past decade. Among deep learning-based action recognition methods, one of the most commonly used structures is a two-stream network. This type of network extracts high-level spatial and temporal features from skeleton coordinates and optical flows, respectively. However, other features, such as the structure of the skeleton or the relations of specific joint pairs, are sometimes ignored, even though using these features can also improve action recognition performance. To robustly learn more low-level skeleton features, this paper introduces an efficient fully convolutional network to process multiple input features. The network has multiple streams, each of which has the same encoder-decoder structure. A temporal convolutional network and a co-occurrence convolutional network encode the local and global features, and a convolutional classifier decodes high-level features to classify the action. Moreover, a novel fusion strategy is proposed to combine independent feature learning and dependent feature relating. Detailed ablation studies are performed to confirm the network's robustness to all feature inputs. If more features are combined and the number of streams increases, performance can be further improved. The proposed network is evaluated on three skeleton datasets: NTU-RGB + D, Kinetics, and UTKinect. The experimental results show its effectiveness and performance superiority over state-of-the-art methods.

Highlights

  • Human action recognition is one of the most challenging tasks in the field of computer vision and video understanding

  • To move beyond the limitation of input features and further improve the performance of skeleton-based action recognition, we propose a robust multi-feature network (MF-Net)

  • EXPERIMENTAL RESULTS we evaluate the performance of our MF-Net

Read more

Summary

Introduction

Human action recognition is one of the most challenging tasks in the field of computer vision and video understanding. Action recognition has undergone a rapid development and has been widely applied to human-computer interaction, visual surveillance, video indexing, virtual reality, etc. The previous studies focused on RGB videos because of convenience of capturing data. The appearance of large-scale 3D skeleton datasets has drawn increasing attention to skeleton-based action recognition. Besides the perception of depth, video data provides another approach to obtaining skeleton data using pose estimation algorithms [3], [4]. Compared to video-based models, skeleton-based models have several merits. They are robust to body scale, motion speed and variations of viewpoints [5], [6].

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call