Abstract

RGB sign videos are easily influenced by light, perspective, and clothing, and skeleton data might be an effective complement. The challenge of multi-modal sign language recognition lies in the fusion of different modalities and the feature extraction of each modality. Therefore, we propose a continuous sign language recognition method based on interactive attention and improved graph convolutional networks (GCN). For the skeleton stream, a hand shift decoupling GCN is proposed to improve the graph modeling capabilities by capturing the interaction between the two hand nodes through the shift of hand features and decoupling operation. Then a cascaded attention module progressively in space, time, and channel dimensions effectively extracts high-level semantic features. To make the best of the two streams’ features, an interactive attention mechanism is designed which takes the semantic features of the skeletons as queries and RGB features as keys and values, enhancing the spatial–temporal correlation mining ability and recognition performance of the network. The proposed method achieves competitive results on two public sign language datasets CSL and RWTH-PHOENIX-Weather-2014.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call