Abstract

Sign language translation is a complex task that involves generating spoken-language sentences from sign language (SL) videos, considering the signer's manual and nonmanual movements. We observed the following issues with existing SL translation (SLT) methods and datasets for improving performance. First, every SL video frame does not have gloss notation. Second, nonmanual components can be easily overlooked despite their importance because they occur in small areas of the image. Third, recent SLT models, based on the transformer, have numerous parameters and struggle to capture the local context of SL images comprehensively. To address these problems, we propose an action tokenizer that divides SL videos into semantic units. In addition, we design a keypoint emphasizer and convolutional-embedded SL transformer (CSLT) to understand noticeable manual and subtle nonmanual features effectively. By applying the proposed modules to Sign2 (Gloss+Text), we introduce CSLT with an action tokenizer and keypoint emphasizer (CSLT-AK), a simple yet efficient and effective SLT model based on domain knowledge. The experimental results on the RWTH-PHOENIX-Weather 2014 T reveal that CSLT-AK surpasses the baseline regarding performance and parameter reduction and demonstrates competitive performance without the need for regularization methods compared to other state-of-the-art models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call