Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Helen L Bear,Richard Harvey

doi:10.3390/app9183870

Helen L Bear, Richard Harvey

Open Access

https://doi.org/10.3390/app9183870

Copy DOI

Abstract

Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as `visemes’. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers.

Highlights

The concept of phonemes is well developed in speech recognition and derives from a definition in phonetics as “the smallest sound one can articulate” [1]
Are phonemes used by linguists and audiologists to describe speech, they are widely used in large-vocabulary speech recognition as the acoustic classes, or ‘units’, to be recognized [2,3,4]
There is an emerging body of work [23,24] that, despite the caveats above, is demonstrating that phoneme lipreading systems can outperform viseme recognizers. In essence it is a tradeoff: does one use viseme units which are tuned to the shape of the lips but suffer with inaccuracies caused by visual confusions between words that sound different but look identical [23]; or does one stick to phonetic units knowing that many of the phonemes are difficult to distinguish on the lips?

Summary

Introduction

The concept of phonemes is well developed in speech recognition and derives from a definition in phonetics as “the smallest sound one can articulate” [1]. There is an emerging body of work [23,24] that, despite the caveats above, is demonstrating that phoneme lipreading systems can outperform viseme recognizers In essence it is a tradeoff: does one use viseme units which are tuned to the shape of the lips but suffer with inaccuracies caused by visual confusions between words that sound different but look identical [23]; or does one stick to phonetic units knowing that many of the phonemes are difficult to distinguish on the lips?. As we shall show in this paper, it need not be an either/or approach to phonemes or visemes; we develop a novel method that allows us to vary the number of classes/visual units This means we can tune the visual units as an intermediary state between the visual and audio spaces and we can optimize against the competing trends of homopheneiosity [27,28] and accuracy [29]. A method for finding optimal visual units, a review of language model units for lipreading systems, a new training paradigm for lipreading systems

Background

Finding a Robust Range of Intermediate Visual Units

Cluster phonemes

Linear Predictor Tracking

Active Appearance Model Features

Step One

Step Two

Step Three

Optimal Visual Unit Set Sizes

Discussion

Hierarchical Training for Weak-Learned Visual Units

Classifier Adaptation Training

4: Phoneme HMM training

Language Network Units

Findings

10. Effects of Training Visual Units for Phoneme Classifiers

11. Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied sciences	Publication Date: Sep 15, 2019
Citations: 9	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences

Lead the way for us

Similar Papers

Processing of linguistic and indexical information in adult cochlear implant users
Terrin N Tamati ... Aaron C Moberly
The Journal of The Acoustical Society of America | VOL. 152
Terrin N Tamati, et. al.Terrin N Tamati ... Aaron C Moberly
01 Oct 2022
The Journal of The Acoustical Society of America | VOL. 152

A real time phonetically based spoken word recognition system and recognizer for unspecified speakers.
Sei-Ichi Nakagawa ... Toshiyuki Sakai
Journal of the Acoustical Society of Japan (E) | VOL. 2
Sei-Ichi Nakagawa, et. al.Sei-Ichi Nakagawa ... Toshiyuki Sakai
01 Jan 1981
Journal of the Acoustical Society of Japan (E) | VOL. 2

Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition

-

01 Jan 2009
01 Jan 2009

SVM-BASED PHONEME CLASSIFICATION AND LIP SHAPE REFINEMENT IN REAL-TIME LIP-SYNCH SYSTEM
HANSEOK KO ... DAVID K. HAN
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 20
HANSEOK KO, et. al.HANSEOK KO ... DAVID K. HAN
01 Nov 2006
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences