Abstract

In this paper, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The system has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared with the state-of-the-art works in lip reading sentences, the system has achieved a significantly improved performance with 15% lower word error rate. In addition, experiments with videos of varying illumination have shown that the proposed model has a good robustness to varying levels of lighting. The main contributions of this paper are: 1) The classification of visemes in continuous speech using a specially designed transformer with a unique topology; 2) The use of visemes as a classification schema for lip reading sentences; and 3) The conversion of visemes to words using perplexity analysis. All the contributions serve to enhance the accuracy of lip reading sentences. The paper also provides an essential survey of the research area.

Highlights

  • T HE task of automated lip reading has attracted a lot of research attention in recent years and many breakthroughs have been made in the area with a variety of machine learning-based approaches having been implemented [1] [2]

  • This paper focuses on improving the accuracy of lip reading sentences and this is achieved by using visemes as a very limited number of classes for classification, a specially designed deep learning model with its own network topology for classifying visemes, and a conversion of recognised visemes to possible words using perplexity analysis

  • The viseme classifier was trained for a total of 2000 epochs and it was at the point that the validation loss started to become saturated, and when no further convergence was recorded that the model was evaluated

Read more

Summary

Introduction

T HE task of automated lip reading has attracted a lot of research attention in recent years and many breakthroughs have been made in the area with a variety of machine learning-based approaches having been implemented [1] [2]. The most recent approaches to automated lip reading are deep learning-based and they largely focus on decoding long speech segments in the form of words and sentences using either words or ASCII characters as the classes to recognise [5] [6] [7] [8] [9] [10]. Lip reading sentences have not succeeded in attaining accuracies as good as word-based approaches. It still remains an ongoing challenging task to automatically lip reading people uttering sentences which cover a wide range of vocabulary and contain words that may not have appeared in the training phase while using the fewest classes possible.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call