Abstract

Recent advancements in machine translation tasks, with the advent of attention mechanisms and Transformer networks, have accelerated the research in Sign Language Translation (SLT), a spatio-temporal vision translation task. Fundamentally, Transformers are unaware of the sequential orderings in input, and therefore position-information should be explicitly fed into them. The sequence learning capability of Transformers is heavily dependent on this ordering information. Compared to the existing Transformer models for SLT that use baseline version with sinusoidal position embedding, this work focuses on incorporating a new positioning scheme into the Transformer networks, in the context of SLT. This is the first work in SLT that explores the positioning scheme of Transformers for optimizing translation scores. The study proposes Gated Recurrent Unit (GRU)-Relative Sign Transformer (RST) for jointly learning Continuous Sign Language Recognition (CSLR) and translation. This model significantly improves the video translation quality. In this approach, GRU acts as the relative position encoder and RST is the Transformer model with relative position incorporated in the Multi-Head Attention (MHA). The evaluation was done on the RWTH-PHOENIX-2014T benchmark dataset. This study reports state-of-the-art Bilingual Evaluation Understudy (BLEU-4) score of 22.4 and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score of 48.55 for SLT, with GRU-RST. The best Word Error Rate (WER) obtained with this approach is 23.5. A detailed study of the position encoding schemes of Transformers is presented. Further, we analyze the translation performance under various combinations of the positioning schemes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call