Abstract

This manuscript introduces the end-to-end embedding of a CNN into a HMM, while interpreting the outputs of the CNN in a Bayesian framework. The hybrid CNN-HMM combines the strong discriminative abilities of CNNs with the sequence modelling capabilities of HMMs. Most current approaches in the field of gesture and sign language recognition disregard the necessity of dealing with sequence data both for training and evaluation. With our presented end-to-end embedding we are able to improve over the state-of-the-art on three challenging benchmark continuous sign language recognition tasks by between 15 and 38% relative reduction in word error rate and up to 20% absolute. We analyse the effect of the CNN structure, network pretraining and number of hidden states. We compare the hybrid modelling to a tandem approach and evaluate the gain of model combination.

Highlights

  • Face-to-face communication is often the preferred choice, when either important matters need to be discussed or informal links between individuals are established

  • We show that different training iterations provide complementary classifiers, which are able to further boost recognition when employed as ensembles of hybrid Convolutional Neural Networks (CNNs)-HMMs

  • With AlexNet we see 30% relative improvement on PHOENIX 2012, 8% on PHOENIX 2014 and 20% on SIGNUM, while with GoogLeNet we see 13% relative improvement on PHOENIX 2012, over 10% on PHOENIX 2014 and again 20% on SIGNUM

Read more

Summary

Introduction

Face-to-face communication is often the preferred choice, when either important matters need to be discussed or informal links between individuals are established. The task of gesture recognition is not accurately defined. This renders comparison of algorithms and approaches difficult. Sign language on the other hand provides a clear framework with a defined inventory and grammatical rules that govern joint expression by hand (movement, shape, orientation, place of articulation) and by face (eye gaze, eye brows, mouth, head orientation). This makes sign languages, the natural languages of the deaf, a perfect test bed for computer vision and human

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call