Multi-stream Asynchrony Modeling for Audio-Visual Speech Recognition

Guoyun Lv,Dongmei Jiang,Yunshu Hou,Rongchun Zhao

doi:10.1109/ism.2007.4412354

Guoyun Lv, Dongmei Jiang + Show 2 more

Open Access

PDF Available

https://doi.org/10.1109/ism.2007.4412354

Copy DOI

Export

Save

Cite

Publication Date: Dec 1, 2007
Citations: 8	License type: cc-by-sa

Affiliation: Northwestern Polytechnical University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

In this paper, two multi-stream asynchrony Dynamic Bayesian Network models (MS-ADBN model and MM-ADBN model) are proposed for audio-visual speech recognition (AVSR). The proposed models, with different topology structures, loose the asynchrony of audio and visual streams to word level. For MS-ADBN model, both in audio stream and in visual stream, each word is composed of its corresponding phones, and each phone is associated with observation vector. MM- ADBN model is an augmentation of MS-ADBN model, a level of hidden nodes--state level, is added between the phone level and the observation node level, to describe the dynamic process of phones. Essentially, MS-ADBN model is a word model, while MM-ADBN model is a phone model. Speech recognition experiments are done on a digit audio-visual (A-V) database, as well as on a continuous A-V database. The results demonstrate that the asynchrony description between audio and visual stream is important for AVSR system, and MM-ADBN model has the best performance for the task of continuous A-V speech recognition.

Full Text