Audio Visual Speech Recognition and Segmentation Based on DBN Models

Dongmei Jiang,Yanning Zhang,Guoyun Lv,Xiaoyue Jiang,Rongchun Zhao,Ilse Ravyse,Hichem Sahli

doi:10.5772/4748

Abstract

Automatic speech recognition is of great importance in human-machine interfaces. Despite extensive effort over decades, acoustic-based recognition systems remain too inaccurate for the vast majority of real applications, especially those in noisy environments, e.g. crowed environment. The use of visual features in audio-visual speech recognition is motivated by the speech formation mechanism and the natural speech ability of humans to reduce audio ambiguities using visual cues. Moreover, the visual information provides complementary cues that cannot be corrupted by the acoustic noise of the environment. However, problems such as the selection of the optimal set of visual features, and the optimal models for audio-visual integration remain challenging research topics. In recent years, the most common model fusion methods for audio visual speech recognition are Multi-stream Hidden Markov Models (MSHMMs) such as product HMM and coupled HMM. In these models, audio and visual features are imported to two or more parallel HMMs with different topology structures. These MSHMMs describe the correlation of audio and visual speech to some extent, and allow asynchrony within speech units. Compared with the single stream HMM, system performance is improved especially in noisy speech environment. But at the same time, problems remain due to the inherent limitation of the HMM structure, that is, on some nodes, such as phones, syllables or words, constraints are imposed to limit the asynchrony between audio stream and visual stream to phone (or syllable, word) level. Since for large vocabulary continuous speech recognition task, phones are the basic modeling units, audio stream and visual stream are forced to be synchronized at the timing boundaries of phones, which is not coherent with the fact that the visual activity often precedes the audio signal even by 120 ms.Besides the audio visual speech recognition to improve the word recognition rate in noisy environments, the task of audio visual speech units (such as phones or visemes) segmentation also requires a more reasonable speech model which describes the inherent correlation and asynchrony of audio and visual speech.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Audio Visual Speech Recognition and Segmentation Based on DBN Models

Abstract

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jun 1, 2007
Citations: 2	License type: cc-by-nc-sa

Similar Papers

Development of audio-visual speech corpus toward speaker-independent Japanese LVCSR
Kazuto Ukai ... Satoshi Tamura
-
Kazuto Ukai, et. al.Kazuto Ukai ... Satoshi Tamura
01 Oct 2016
01 Oct 2016

Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition
Ahmed Hussen Abdelaziz
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 26
Ahmed Hussen AbdelazizAhmed Hussen Abdelaziz
01 Mar 2018
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 26

Audio-visual continuous speech recognition using a coupled hidden Markov model
Xiaoxing Liu ... Luhong Liang
-
Xiaoxing Liu, et. al.Xiaoxing Liu ... Luhong Liang
16 Sep 2002
16 Sep 2002

State Synchronous Modeling on Phone Boundary for Audio Visual Speech Recognition and Application to Muti-View Face Images
Kenichi Kumatani ... Rainer Stiefelhagen
-
Kenichi Kumatani, et. al.Kenichi Kumatani ... Rainer Stiefelhagen
01 Apr 2007
01 Apr 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Audio Visual Speech Recognition and Segmentation Based on DBN Models

Abstract

Talk to us

Similar Papers