Abstract

Continuous sign language recognition is a weakly supervised task dealing with the identification of continuous sign gestures from video sequences, without any prior knowledge about the temporal boundaries between consecutive signs. Most of the existing methods focus mainly on the extraction of spatio-temporal visual features without exploiting text or contextual information to further improve the recognition accuracy. Moreover, the ability of deep generative models to effectively model data distribution has not been investigated yet in the field of sign language recognition. To this end, a novel approach for context-aware continuous sign language recognition using a generative adversarial network architecture, named as Sign Language Recognition Generative Adversarial Network (SLRGAN), is introduced. The proposed network architecture consists of a generator that recognizes sign language glosses by extracting spatial and temporal features from video sequences, as well as a discriminator that evaluates the quality of the generator’s predictions by modeling text information at the sentence and gloss levels. The paper also investigates the importance of contextual information on sign language conversations for both Deaf-to-Deaf and Deaf-to-hearing communication. Contextual information, in the form of hidden states extracted from the previous sentence, is fed into the bidirectional long short-term memory module of the generator to improve the recognition accuracy of the network. At the final stage, sign language translation is performed by a transformer network, which converts sign language glosses to natural language text. Our proposed method achieved word error rates of 23.4%, 2.1% and 2.26% on the RWTH-Phoenix-Weather-2014 and the Chinese Sign Language (CSL) and Greek Sign Language (GSL) Signer Independent (SI) datasets, respectively.

Highlights

  • Sign language (SL) is the primary communication means of hearing-impaired people in their everyday life, and it consists of a well-structured set of grammar rules and vocabulary, to spoken languages

  • The Continuous Sign LanguageRecognition (CSLR) task focuses on recognizing sequences of glosses from videos without predefined annotation boundaries, and it is more challenging compared to Isolated Sign Language Recognition (ISLR) [9], in which the temporal boundaries of glosses in the videos are predefined

  • In the last set of experiments, the predictions of Sign Language Recognition Generative Adversarial Network (SLRGAN) were fed into the transformer for SLT

Read more

Summary

Introduction

Sign language (SL) is the primary communication means of hearing-impaired people in their everyday life, and it consists of a well-structured set of grammar rules and vocabulary, to spoken languages. Sign Language Recognition (SLR) methods for the automated recognition of glosses and the translation of SL to the spoken language is of great importance for the communication of the Deaf community with hearing people (i.e., Deaf-to-hearing communication and vice versa) or among different Deaf communities (i.e., Deaf-to-Deaf communication), it is still considered as a challenging research area. This is mainly because sign languages feature thousands of signs, sometimes differing only by subtle changes in hand motion, shape, or position and involving significant finger overlaps and occlusions [2]. The CSLR task focuses on recognizing sequences of glosses from videos without predefined annotation boundaries, and it is more challenging compared to ISLR [9], in which the temporal boundaries of glosses in the videos are predefined

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call