Abstract

Sign language recognition (SLR) tasks are typically performed on a large number of continuous frames of sign language videos, making it challenging to utilize information efficiently and accurately from such a vast amount of data. Firstly, to efficiently utilize the information, it is necessary to select key information from numerous video frames to summarize the video. Therefore, this paper proposes a new method for key frame extraction that efficiently summarizes video information and addresses the issues of key frame clustering and data imbalance present in previous methods. Secondly, for more accurate information utilization, it is essential to learn the correct sign language expression features from a large dataset. Hence, we establish the Full-hand two-stream network to focus on the most crucial hand features in sign language expressions. The Full stream employs attention blocks to extract deep-level information and establishes temporal dependencies using global attention module. The Hand stream utilizes hand attention to focus on hand-specific feature information. Ultimately, our approach achieves state-of-the-art results on the CSL–500 Dataset and competitive results on the LSA64 Dataset. This paper validates the effectiveness of keyframe extraction methods and two-stream networks on the AUTSL dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call