Abstract

There have been higher demands recently for Automatic Speech Recognition (ASR) systems able to operate robustly in acoustically noisy environments. This paper proposes a method to effectively integrate audio and visual information in audiovisual (bi-modal) ASR systems. For such integration, the following issues are important: (1) The synchronization of the audio and visual information, and (2) The optimization of a system in its environment. In (1), the individual feature of the speech and lip movements has the time lag, and has the correlation. To address this problem, we introduce an integration method using HMM composition. In (2), we examine whether the GPD algorithm can adaptively estimate the stream weights. Evaluation experiments show that the proposed method improves the recognition accuracy for noisy speech.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call