Abstract

Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes’. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers.

Highlights

  • The concept of phonemes is well developed in speech recognition and derives from a definition in phonetics as “the smallest sound one can articulate” [1]

  • Are phonemes used by linguists and audiologists to describe speech, they are widely used in large-vocabulary speech recognition as the acoustic classes, or ‘units’, to be recognized [2,3,4]

  • There is an emerging body of work [23,24] that, despite the caveats above, is demonstrating that phoneme lipreading systems can outperform viseme recognizers. In essence it is a tradeoff: does one use viseme units which are tuned to the shape of the lips but suffer with inaccuracies caused by visual confusions between words that sound different but look identical [23]; or does one stick to phonetic units knowing that many of the phonemes are difficult to distinguish on the lips?

Read more

Summary

Introduction

The concept of phonemes is well developed in speech recognition and derives from a definition in phonetics as “the smallest sound one can articulate” [1]. There is an emerging body of work [23,24] that, despite the caveats above, is demonstrating that phoneme lipreading systems can outperform viseme recognizers In essence it is a tradeoff: does one use viseme units which are tuned to the shape of the lips but suffer with inaccuracies caused by visual confusions between words that sound different but look identical [23]; or does one stick to phonetic units knowing that many of the phonemes are difficult to distinguish on the lips?. As we shall show in this paper, it need not be an either/or approach to phonemes or visemes; we develop a novel method that allows us to vary the number of classes/visual units This means we can tune the visual units as an intermediary state between the visual and audio spaces and we can optimize against the competing trends of homopheneiosity [27,28] and accuracy [29]. A method for finding optimal visual units, a review of language model units for lipreading systems, a new training paradigm for lipreading systems

Background
Finding a Robust Range of Intermediate Visual Units
Cluster phonemes
Linear Predictor Tracking
Active Appearance Model Features
Step One
Step Two
Step Three
Optimal Visual Unit Set Sizes
Discussion
Hierarchical Training for Weak-Learned Visual Units
Classifier Adaptation Training
4: Phoneme HMM training
Language Network Units
Findings
10. Effects of Training Visual Units for Phoneme Classifiers
11. Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call