Abstract

The aim of Continuous Sign Language Recognition (CSLR) is to recognize unsegmented signs from image streams and convert them to gloss sequences. Data collection and annotation of sign language is costly, most recent CSLR works handle this problem in a weakly supervised manner and adopt network architectures consisting of a feature extractor and an alignment module. In their design, the gloss-level semantic feature is used to align the gesture-level representational feature, which may suppress the activation of non-key frames. To further enhance the generalization of the visual extractor, this paper proposes a Visual-Lexical Alignment Constraint (VLAC) with an improved self-distillation based alignment supervision. Specifically, the proposed model adopts a visual-lexical module to supervise the visual and the lexical modules separately. In addition, we modify the traditional lexical extractor from Bi-LSTM to lite-transformer encoder, which is beneficial in terms of parallelization and computational efficiency. Experimental results on the RWTH-PHOENIX-Weather-2014 dataset show that the proposed model exceeds the current state-of-the-art models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call