Abstract

Sign Language Recognition (SLR) targets on interpreting sign language video into natural language, which largely facilitates mutual communication between the deaf and general public. SLR is usually formulated as a sequence alignment problem, wherein connectionist temporal classification (CTC) plays an important role in building effective alignment between video sequence and sentence-level labels. However, CTC-based SLR methods tend to fail if the output label sequence is longer than the input video sequence. Besides, they ignore the interdependencies between output predictions. This paper addresses these two issues and proposes a new RNN-Transducer based SLR framework, i.e., visual hierarchy to lexical sequence alignment network (H2SNet). In the framework, we design a visual hierarchy transcription network to capture the spatial appearance and temporal motion cues of sign video on multiple levels. Meanwhile, we utilize a lexical prediction network to extract effective contextual information from output predictions. RNN-Transducer is applied to learn the mapping between sequential video features and sentence-level labels. Extensive experiments validate the effectiveness and superiority of our approach over state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call