Abstract

In this paper, we focus on large-scale isolated gesture recognition for RGB-D videos. We develop a novel ensemble method to explore deep spatio-temporal features using 3D Convolutional Neural Networks (CNNs) with residual architecture (Res-C3D) and build a time-series model with skeleton information based on Long Short Term Memory network (LSTM). First, relative positions and angles of different keypoints are extracted and used to build time-series model in LSTM. Obtaining the skeleton information (keypoints) of body and reserving arm regions with discarding other parts, masked Res-C3D is obtained, which decreases the effect of the background and other variations, as gestures are mainly derived from the arm or hand movements. Moreover, the weights of each voting sub-classifier being of advantage to a certain class in our ensemble model are adaptively obtained by training in place of fixed weights. Our experimental results show that the proposed method has obtained a state-of-the-art performance with accuracy 0.6842 in the IsoGD dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.