Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Petar S Aleksic,Aggelos K Katsaggelos,Jay J Williams,Zhilin Wu

doi:10.1155/s1110865702206162

Abstract

We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs) supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA) was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR) experiments. Both single-stream and multistream hidden Markov models (HMMs) were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words) speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB) with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

Highlights

Human listeners use visual information, such as facial expressions, and lips and tongue movement, in order to improve perception of the uttered audio signal [1]
Since melfrequency cepstral coefficients (MFCC) were obtained at a rate of 90 Hz, while facial animation parameters (FAPs) at a rate of 30 Hz, FAPs were interpolated in order to obtain synchronized data
We described a robust and automatic FAP extraction system that we have implemented, using Gradient Vector Field (GVF) snake and parabolic templates

Summary

INTRODUCTION

Human listeners use visual information, such as facial expressions, and lips and tongue movement, in order to improve perception of the uttered audio signal [1]. The active contour method is a relatively new method for extraction of visual features, and is very useful in cases when it is hard to present the shape of an object with a simple template [16, 21]. To the best of our knowledge no results have been previously reported on the improvement of AVSR performance when FAPs are used as visual features with a relatively large vocabulary audio-visual database of about 1000 words. Reporting on such results is the main objective of this paper.

THE AUDIO-VISUAL DATABASE

VISUAL FEATURE EXTRACTION

Facial animation parameters

Gradient vector flow snake

Tongue

Parabola templates

VISUAL FEATURE DIMENSIONALITY REDUCTION

AUDIO-VISUAL INTEGRATION

The single-stream HMM

The multistream HMM

SPEECH RECOGNITION EXPERIMENTS

Audio-only speech recognition experiments

Audio-visual speech recognition experiments

Experiments with single-stream HMMs

Experiments with multistream HMMs

Experiments with clean speech

Findings

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Advances in Signal Processing	Publication Date: Nov 28, 2002
Citations: 81	License type: cc-by

R Discovery Prime

R Discovery Prime

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Advances in Signal Processing

Lead the way for us

Similar Papers

Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance
P.S Aleksic ... K Katsaggelos
-
P.S Aleksic, et. al.P.S Aleksic ... K Katsaggelos
01 Jan 2004
01 Jan 2004

Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition
P.S Aleksic ... A.K Katsaggelos
-
P.S Aleksic, et. al.P.S Aleksic ... A.K Katsaggelos
17 May 2004
17 May 2004

Audio and visual modality combination in speech processing applications
Gerasimos Potamianos ... Vaibhava Goel
-
Gerasimos Potamianos, et. al.Gerasimos Potamianos ... Vaibhava Goel
24 Apr 2017
24 Apr 2017

Audio-visual continuous speech recognition using MPEG-4 compliant visual features
P.S Aleksic ... Zhilin Wu
-
P.S Aleksic, et. al.P.S Aleksic ... Zhilin Wu
10 Dec 2002
10 Dec 2002

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Advances in Signal Processing