Abstract
Recently, the application of transformer makes significant progress in sign language translation. However, several characteristics of sign videos are neglected in existing transformer-based methods that hinder translation performance. Firstly, in sign videos, multiple consecutive frames represent a single sign gloss thus the local temporal relations are crucial. Secondly, the inconsistency between video and text demands the non-local and global context modeling ability of the model. To address these issues, a locality-aware transformer is proposed for sign language translation. Concretely, the multi-stride position encoding scheme assigns the same position index to adjacent frames with various strides to enhance the local dependency. Afterward, the adaptive temporal interaction module is utilized to capture non-local and flexible local frame correlation simultaneously. Moreover, a gloss counting task is designed to facilitate the holistic understanding of sign videos. Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed framework.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.