Abstract
Sign language recognition (SLR) is an important and challenging research topic in the multimedia field. Conventional techniques for SLR rely on hand-crafted features, which achieve limited success. In this paper, we present attention-based 3D-convolutional neural networks (3D-CNNs) for SLR. The framework has two advantages: 3D-CNNs learn spatio-temporal features from raw video without prior knowledge and the attention mechanism helps to select the clue. When training 3D-CNN for capturing spatio-temporal features, spatial attention is incorporated into the network to focus on the areas of interest. After feature extraction, temporal attention is utilized to select the significant motions for classification. The proposed method is evaluated on two large scale sign language data sets. The first one, collected by ourselves, is a Chinese sign language data set that consists of 500 categories. The other is the ChaLearn14 benchmark. The experiment results demonstrate the effectiveness of our approach compared with state-of-the-art algorithms.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Circuits and Systems for Video Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.