In sign language communication, the combination of hand signs and facial expressions is used to convey messages in a fluid manner. Accurate interpretation relies heavily on understanding the context of these signs. Current methods, however, often focus on static images, missing the continuous flow and the story that unfolds through successive movements in sign language. To address this constraint, our research introduces the Dynamic Feature Contrast Net Plus (DFCNet+), a novel model that incorporates both dynamic feature extraction and cross-modal learning. The dynamic feature extraction module of DFCNet+ uses dynamic trajectory capture to monitor and record motion across frames and apply key features as an enhancement tool that highlights pixels that are critical for recognizing important sign language movements, allowing the model to follow the temporal variation of the signs. In the cross-modal learning module, we depart from the conventional approach of aligning video frames with textual descriptions. Instead, we adopt a gloss-level alignment, which provides a more detailed match between the visual signals and their corresponding text glosses, capturing the intricate relationship between what is seen and the associated text. The enhanced proficiency of DFCNet+ in discerning inter-frame details translates to heightened precision on benchmarks such as PHOENIX14, PHOENIX14-T and CSL-Daily. Such performance underscores its advantage in dynamic feature capture and inter-modal learning compared to conventional approaches to sign language interpretation. Our code is available at https://github.com/fyzjut/DFCNet_Plus.
Read full abstract