Abstract

Human-Robot interaction (HRI) usually focuses on the interaction between normal people and robots, ignoring the needs of deaf-mute people. Deaf-mute individuals utilize sign language to communicate their thoughts and emotions. Therefore, continuous sign language recognition (CSLR) can be introduced to the robot for communicating with deaf-mute people. However, the mainstream CSLR, which consists of two main modules, i.e., visual feature extraction and contextual modeling, has several problems. Visual features are usually extracted frame-by-frame and lack global contextual information, which results in a crucial impact on subsequent context modeling. In addition, we discovered a substantial degree of redundancy in the sign language data, which can significantly slow down model training and exacerbate the problem of model overfitting. To solve these problems, in this paper, we propose a novel vision transformer-based sign language recognition network combined with the off-frame extraction (KFE) module for accurate end-to-end recognition of input video sequences. Two CSLR benchmarks, TJUT-SLRT and USTC-CSL, have been the subject of our experiments. The outcomes of our experiments illustrate the efficacy of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call