Abstract
Sign language translation is a complex task that involves generating spoken-language sentences from sign language (SL) videos, considering the signer's manual and nonmanual movements. We observed the following issues with existing SL translation (SLT) methods and datasets for improving performance. First, every SL video frame does not have gloss notation. Second, nonmanual components can be easily overlooked despite their importance because they occur in small areas of the image. Third, recent SLT models, based on the transformer, have numerous parameters and struggle to capture the local context of SL images comprehensively. To address these problems, we propose an action tokenizer that divides SL videos into semantic units. In addition, we design a keypoint emphasizer and convolutional-embedded SL transformer (CSLT) to understand noticeable manual and subtle nonmanual features effectively. By applying the proposed modules to Sign2 (Gloss+Text), we introduce CSLT with an action tokenizer and keypoint emphasizer (CSLT-AK), a simple yet efficient and effective SLT model based on domain knowledge. The experimental results on the RWTH-PHOENIX-Weather 2014 T reveal that CSLT-AK surpasses the baseline regarding performance and parameter reduction and demonstrates competitive performance without the need for regularization methods compared to other state-of-the-art models.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.