Multi-View Fusion for Sign Language Recognition through Knowledge Transfer Learning

Liqing Gao,Lei Zhu,Ping Li,Senhua Xue,Liang Wan,Wei Feng

doi:10.1145/3574131.3574434

Abstract

Word-level sign language recognition (WSLR), which aims to translate a sign video into one word, serves as a fundamental task in visual sign language research. Existing WSLR methods focus on recognizing frontal view hand images, which may hurt performance due to hand occlusion. However, non-frontal view hand images contain complementary and beneficial information that can be used to enhance the frontal view hand images. Based on this observation, the paper presents an end-to-end Multi-View Knowledge Transfer (MVKT) network, which, to our knowledge, is the first SLR work to learn visual features from multiple views simultaneously. The model consists of three components: 1) the 3D-ResNet backbone, to extract view-common and view-specific representations; 2) the Knowledge Transfer module, to interchange complementary information between views; and 3) the View Fusion module, to aggregate discriminative representations for obtaining global clues. In addition, we construct a Multi-View Sign Language (MVSL) dataset, which contains 10,500 sign language videos synchronously collected from multiple views with clear annotations and high quality. Extensive experiments on the MVSL dataset shows that the MVKT model trained with multiple views can achieve significant improvement when tested with either multiple or single views, which makes it feasible and effective in real-world applications.

Full Text