Abstract

Action recognition has been achieved great progress in recent years because of better feature representation learning and classification technology like convolutional neural networks (CNNs). However, most current deep learning approaches treat the action recognition as a black box, ignoring the specific domain knowledge of action itself. In this paper, by analyzing the characteristics of different actions, we proposed a new framework that involves residual-attention module and joint path-signature feature (JPSF) representation framework. The path signature theory was developed recently in the field of rough path and stochastic analysis, which provides a very efficient way to analyze any temporal sequence data. The proposed n-fold joint path signature features entail the Euclidean distances between joints and respective angles. For our experiment, JPSF for three modalities of joints (spatial location, bi-folds and tri-folds) are computed over the temporal length of action sequence. Then all these PSF are concatenated and fed to a CNN to give the recognition result. Experiments on three benchmark datasets, J-HMDB, HMDB-51 and UCF-101, indicate that our proposed method achieves state-of-the-art performance.

Highlights

  • Recognizing actions in videos are considered to be a very challenging task in computer vision

  • A great progress has been embarked in action recognition over the last decade due to convolutional neural networks (CNNs) [1] and Recurrent Neural Networks (RNNs)

  • A temporal sequence like on-line text recognition is represented as path signature features in order to feed into convolutional neural networks

Read more

Summary

INTRODUCTION

Recognizing actions in videos are considered to be a very challenging task in computer vision. A temporal sequence like on-line text recognition is represented as path signature features in order to feed into convolutional neural networks. By using 3D-ConvNets [1], [10], videos are represented as spatio-temporal blobs and 3D convolution models are trained for action recognition. An extension of Two-stream network to inflated 3D ConvNet is proposed in [4] by expanding 3D convolutional networks in order to learn the spatio-temporal features for video classification. In [29], the authors developed an attention-based neural network in order to model the scene objects interaction for action recognition and video captioning. In [33], authors reveal that path signatures can be used as a set of features for input to convolutional neural networks (CNNs), which improve the accuracy of on-line character recognition. In [37], a PSFbased approach has been employed for action recognition in videos

THE SIGNATURE OF A PATH
PROPOSED METHOD
ATTENTION NETWORK FOR ACTION RECOGNITION
RESULTS AND ANALYSIS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call