Abstract

Sign language is used across the world for communication purposes within hearing-impaired communities. Hearing people are not well versed in sign language and most hearing-impaired are not good in general text, creating a communication barrier. Research on Sign Language Recognition (SLR) systems have shown admirable solutions for this issue. In Sri Lanka, machine learning along with neural networks has been the prominent domain of research in Sinhala SLR. All previous research is mainly focused on word-level SLR using hand gestures for translation. While this works for a certain vocabulary, there are many signs interpreted through other spatial cues like lip movements and facial expressions. Therefore, translation is limited and sometimes the interpretations can be misleading. In this research, we propose a multi-modal Deep Learning approach that can effectively recognize sentence-level sign gestures using hand and lip movements and translate to Sinhala text. The model consists of modules for visual feature extraction (ResNet), contextual relationship modeling (transformer encoder with multi-head attention), alignment (CTC) and decoding (Prefix beam search). A dataset consisting 22 of sentences used for evaluations was collected under controlled conditions for a specific day-to-day scenario (a conversation between a vendor and a customer in a shop). The proposed model achieves a best Word Error Rate (WER) of 12.70 on the testing split, improving over the single-stream model which shows a best WER of 17.41, suggesting a multi-modal approach improves overall SLR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call