Abstract

There are two basic problems in sign language recognition (SLR): (a) isolated word SLR and (b) continuous SLR. Most of the existing continuous SLR methods are extensions of the isolated word SLR methods. These methods use the isolated word SLR results as the basic module and obtain the sentence recognition results through sentence segmentation and word alignment. However, sentence segmentation and word alignment are often not accurate, resulting in a low sentence recognition accuracy. At the same time, continuous SLR usually requires strict sample labels, leading to the difficult task of manual labeling and limited training data availability. To address these challenges, this paper proposes a bidirectional spatial–temporal LSTM fusion attention network (Bi-ST-LSTM-A) for continuous SLR. This approach avoids problems such as sentence segmentation, word alignment, and tedious manual labeling. Our contributions are summarized as follows: (1) we proposed a sign language video feature representation method using a convolutional neural network (CNN) and spatial–temporal LSTM (ST-LSTM) information fusion technology; and (2) we constructed a uniform neural machine translation framework that can be used for complex continuous SLR and gesture recognition of nonspecific people in nonspecific environments. Experiments were carried out on some large continuous sign language datasets. The sign language recognition accuracy reached 81.22% on the 500 CSL dataset, 76.12% on the RWTH-PHOENIX-Weather dataset and 75.32% on the RWTH-PHOENIX-Weather-2014T dataset, thereby illustrating the effectiveness of the proposed framework.

Highlights

  • T HE goal of video-based sign language recognition (SLR) is to convert a video sequence into a sign language text representation [1]– [4]

  • (1) To achieve the SLR task based on the RGB video data, and inspired by the significant results achieved by deep learning technology in object detection, we proposed a sign language detection and representation framework in RGB video

  • In this paper, a continuous SLR framework based on an STLSTM fusion attention network is proposed

Read more

Summary

Introduction

T HE goal of video-based SLR is to convert a video sequence into a sign language text representation [1]– [4]. SLR, and continuous SLR [1], [4], is a relatively new field of human–computer interaction (HCI). Many researchers have explored this area [8], [9], there are still many challenges and problems. A key challenge of SLR is the design of visual descriptors to capture SL semantics, such as facial expressions and the shape, direction, and position of hands [1], [3]. Most of the existing sign language video sequences are recorded using normal cameras that lack depth sensors, limiting the practical application of the existing SLR method

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call