Abstract

The quality of recognition systems for continuous utterances in signed languages could be largely advanced within the last years. However, research efforts often do not address specific linguistic features of signed languages, as e.g., non-manual expressions. In this work, we evaluate the potential of a single video camera-based recognition system with respect to the latter. For this, we introduce a two-stage pipeline based on two-dimensional body joint positions extracted from RGB camera data. The system first separates the data flow of a signed expression into meaningful word segments on the base of a frame-wise binary Random Forest. Next, every segment is transformed into image-like shape and classified with a Convolutional Neural Network. The proposed system is then evaluated on a data set of continuous sentence expressions in Japanese Sign Language with a variation of non-manual expressions. Exploring multiple variations of data representations and network parameters, we are able to distinguish word segments of specific non-manual intonations with 86% accuracy from the underlying body joint movement data. Full sentence predictions achieve a total Word Error Rate of 15.75%. This marks an improvement of 13.22% as compared to ground truth predictions obtained from labeling insensitive towards non-manual content. Consequently, our analysis constitutes an important contribution for a better understanding of mixed manual and non-manual content in signed communication.

Highlights

  • Systems that process and understand expressions in Sign Language (SL) have a great potential to facilitate daily life of individuals that are deaf or hard of hearing

  • For all Convolutional Neural Network (CNN) combinations, the average Word Error Rate (WER) corresponds to the performance of the class-based word recognition

  • We implemented a novel staged system for Continuous Sign Language Recognition, whose ability to understand complex linguistic content was evaluated with a set of signed video sequences in Japanese Sign Language

Read more

Summary

Introduction

Systems that process and understand expressions in Sign Language (SL) have a great potential to facilitate daily life of individuals that are deaf or hard of hearing. To date no universal system could be found that would be accurate, reliable and applicable to general, daily use. This is due to a number of complexities specific to SLs. First, SLs are visual languages and impose specific sensing requirements to obtain meaningful representations of the moving joint trajectories through time and space. A number of learning systems utilizing both staged and combined strategies to address the problem of CSLR from video data have been reported. Combined systems mainly evolved with the technological possibility of end-to-end learning. They aim to unify the two problems into one model architecture in order to prevent error accumulation caused by imperfect temporal segmentation. The first deep network proposed by Koller et al [16,17] utilized a CNN as feature extractor for classification of hand shapes on top of a segmentation step based on a Hidden Markov

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call