Abstract

<p>In computer vision, one of the most difficult problems is human gestures in videos recognition Because of certain irrelevant environmental variables. This issue has been solved by using single deep networks to learn spatiotemporal characteristics from video data, and this approach is still insufficient to handle both problems at the same time. As a result, the researchers fused various models to allow for the effective collection of important shape information as well as precise spatiotemporal variation of gestures. In this study, we collected the dynamic dataset for twenty meaningful words of Arabic sign language (ArSL) using a Microsoft Kinect v2 camera. The recorded data included 7350 red, green, and blue (RGB) videos and 7350 depth videos. We proposed four deep neural networks models using 2D and 3D convolutional neural network (CNN) to cover all feature extraction methods and then passing these features to the recurrent neural network (RNN) for sequence classification. Long short-term memory (LSTM) and gated recurrent unit (GRU) are two types of using RNN. Also, the research included evaluation fusion techniques for several types of multiple models. The experiment results show the best multi-model for the dynamic dataset of the ArSL recognition achieved 100% accuracy.</p>

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call