Abstract

Convolutional Neural Network (CNN) brings us advantages to extract image features. The Long Short Term Memory (LSTM) is a natural choice to construct time sequence models. We combine these two methods to generate an end-to-end model for gesture recognition. In this study, we propose a neural network structure, using the general CNN network to extract frame-level spatial features, and using LSTM to extract time sequence features. We name this network Temporal Convolution Neural Network (TCNN). Our experiments are performed with VIVA Gesture Dataset, and this dataset has 19 gestures, labeled by 8 people. Through 8-cross-fold validation, the network structure we propose has better performance than the state of art methods like 3DCNN. Meanwhile, we compare the results conducted with different general CNNs. That is, the network based on ReNet50 has the accuracy of 82.3% while light and shallow network MobileNet has the accuracy of 60%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.