Abstract

Visual feature extraction is the key to continuous sign language recognition (CSLR). However, current CSLR methods based on single-branch networks only rely on intra-video correlation learning to facilitate visual feature optimization. Obviously, this is not conducive to obtaining more robust visual feature representations. Hence, we pioneered a novel CSLR method via intra- and inter-video correlation learning with a two-branch network, named TB-Net. TB-Net explicitly establishes intra-video correlation between glosses and the most relevant video clips at each branch and then introduces inter-video correlation to enhance visual feature extraction at the branch confluence. Specifically, we introduce a contrastive learning-based inter-video correlation module (IEM), which co-optimizes visual features from both branches by designing inter- and intra-video losses to enhance their generalization. In addition, we propose an intra-video correlation module (IAM) based on a gloss-guided attention feature generator to adaptively build mappings between glosses and video clips, which in turn contributes to the acquisition of preliminary gloss-guided visual features within a single video. Extensive experiments on four public CSLR benchmarks show the superior performance of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call