Independent information from visual features for multimodal speech recognition

Sabri Gurbuz ,Eric Patterson ,Zekeriya Tüfekçi ,J.n Gowdy

doi:10.1109/secon.2001.923119

Abstract

The performance of audio-based speech recognition systems degrades severely when there is a mismatch between training and usage environments due to background noise. This degradation is due to a loss of ability to extract and distinguish important information from audio features. One of the emerging techniques for dealing with this problem is the addition of visual features in a multimodal recognition system. This paper presents an affine-invariant, multimodal speech recognition system and focuses on the additional information that is available from video features. Results are presented that demonstrate the distinct information available from a visual subsystem that will allow optimal joint-decisions based on the SNR-ratio and type of noise to exceed either audio or video subsystem in nearly all noisy environments.

Full Text