Abstract

Word-level sign language recognition (WSLR), which aims to translate a sign video into one word, serves as a fundamental task in visual sign language research. Existing WSLR methods focus on recognizing frontal view hand images, which may hurt performance due to hand occlusion. However, non-frontal view hand images contain complementary and beneficial information that can be used to enhance the frontal view hand images. Based on this observation, the paper presents an end-to-end Multi-View Knowledge Transfer (MVKT) network, which, to our knowledge, is the first SLR work to learn visual features from multiple views simultaneously. The model consists of three components: 1) the 3D-ResNet backbone, to extract view-common and view-specific representations; 2) the Knowledge Transfer module, to interchange complementary information between views; and 3) the View Fusion module, to aggregate discriminative representations for obtaining global clues. In addition, we construct a Multi-View Sign Language (MVSL) dataset, which contains 10,500 sign language videos synchronously collected from multiple views with clear annotations and high quality. Extensive experiments on the MVSL dataset shows that the MVKT model trained with multiple views can achieve significant improvement when tested with either multiple or single views, which makes it feasible and effective in real-world applications.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.