Abstract

In recent years, gesture recognition has been used in many fields, such as games, robotics and sign language recognition. Human computer interaction (HCI) has been significantly improved by the development of gesture recognition, and now gesture recognition in video is an important research direction. Because each kind of neural network structure has its limitation, we proposed a neural network with alternate fusion of 3D CNN and ConvLSTM, which we called the Multiple extraction and Multiple prediction (MEMP) network. The main feature of the MEMP network is to extract and predict the temporal and spatial feature information of gesture video multiple times, which enables us to obtain a high accuracy rate. In the experimental part, three data sets (LSA64, SKIG and Chalearn 2016) are used to verify the performance of network. Our approach achieved high accuracy on those data sets. In the LSA64, the network achieved an identification rate of 99.063%. In SKIG, this network obtained the recognition rates of 97.01% and 99.02% in the RGB part and the rgb-depth part. In Chalearn 2016, the network achieved 74.57% and 78.85% recognition rates in RGB part and rgb-depth part respectively.

Highlights

  • Gesture communication is a widely used method in people’s daily lives

  • During the work of Multiple extraction and Multiple prediction (MEMP) Neural Networks, 3D CNN is used to extract the spatial and temporal feature information of each frame, and ConvLSTM was used to predict the set of features

  • The current method based on deep learning is the main research aspect of gesture recognition

Read more

Summary

Introduction

Gesture communication is a widely used method in people’s daily lives. Gesture interaction can be used in many kinds of scenes and has rich expressive power. The 2D CNN can achieve the effect of predicting video by extracting spatial feature information of successive sets of frames in the video. ConvLSTM is a network structure proposed according to convolution operation and LSTM It does not just extract spatial features like CNN, and model according to time series like LSTM [10]. In this paper, we proposed a multi-prediction neural network with multiple mixing of 3D CNN operation and convLSTM operation The MEMP network can improve the accuracy of gesture recognition Experiments show that this network is suitable for medium and large data sets. Compared with the traditional combined neural network, MEMP network retains more spatial-temporal feature information through multiple information extraction and prediction of feature maps. We will use three data sets to verify the characteristics of the MEMP network

Related Work
Proposed Method
Datasets
Video Processing
Implementation
Experimental Results
Method
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call