Abstract

Among the various methods proposed to improve the accuracy and the robustness of automatic speech recognition (ASR), the use of additional knowledge sources is a successful one. In particular, a recent method proposes supplementing the acoustic information with visual data mostly derived from the speaker's lip shape. Perceptual studies support this approach by emphasising the importance of visual information for speech recognition in humans. This paper describes a method we have developed for adaptive integration of acoustic and visual information in ASR. Each modality is involved in the recognition process with a different weight, which is dynamically adapted during this process mainly according to the signal-to-noise ratio provided as a contextual input. We tested this method on continuous hidden Markov model-based systems developed according to direct identification (DI), separate identification (SI) and hybrid identification (DI + SI) strategies. Experiments performed under various noise-level conditions show that the DI + SI based system is the most promising one when compared to both DI and SI-based systems for a speaker-dependent continuous-spelling of French letters recognition task. They also confirm that using adaptive modality weights instead of fixed weights allows for performance improvement and that weight estimation could benefit from using visemes as decision units for the visual recogniser in SI and DI + SI based systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call