Abstract

Large datasets are often used to improve the accuracy of action recognition. However, very large datasets are problematic as, for example, the annotation of large datasets is labor-intensive. This has encouraged research in zero-shot action recognition (ZSAR). Presently, most ZSAR methods recognize actions according to each video frame. These methods are affected by light, camera angle, and background, and most methods are unable to process time series data. The accuracy of the model is reduced owing to these reasons. In this paper, in order to solve these problems, we propose a three-stream graph convolutional network that processes both types of data. Our model has two parts. One part can process RGB data, which contains extensive useful information. The other part can process skeleton data, which is not affected by light and background. By combining these two outputs with a weighted sum, our model predicts the final results for ZSAR. Experiments conducted on three datasets demonstrate that our model has greater accuracy than a baseline model. Moreover, we also prove that our model can learn from human experience, which can make the model more accurate.

Highlights

  • Human action recognition is currently a popular research field

  • In order to solve the problems of excessive training data and labor cost, researchers have attempted to combine action recognition methods and zero-shot learning (ZSL)

  • In order to solve the problem of very large datasets for action recognition, we combine action recognition with ZSL, which is called zero-shot action recognition (ZSAR)

Read more

Summary

Introduction

Human action recognition is currently a popular research field. it is a complex task, involving recognition of the action performed by a person, interaction between people, and interaction between people and the environment. In order to solve the problems of excessive training data and labor cost, researchers have attempted to combine action recognition methods and zero-shot learning (ZSL). We used a pre-trained spatial temporal graph convolutional network (ST-GCN) [3] to extract the motion data of the video and used the deep visual-semantic embedding model (DeViSE) [4] to predict the unseen video. This approach compensates for the shortcomings of the baseline model and improves model accuracy. To improve the comprehensibility of the paper, we have revised the description of the proposed method and its evaluation

Action Recognition
Zero-Shot Learning
Zero-Shot Action Recognition
Two-Stream Graph Convolutional Networks
Three-Stream Graph Convolutional Networks
Knowledge Graph
Extracting Features
Object Detection
Dataset
Baseline Comparison
Learning from Human Experience
Different Word Embeddings
Different Weights
Visualization of Semantic Space with Different Word Embeddings
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.