Abstract
Among the various methods proposed to improve the accuracy and the robustness of automatic speech recognition (ASR), the use of additional knowledge sources is a successful one. In particular, a recent method proposes supplementing the acoustic information with visual data mostly derived from the speaker's lip shape. Perceptual studies support this approach by emphasising the importance of visual information for speech recognition in humans. This paper describes a method we have developed for adaptive integration of acoustic and visual information in ASR. Each modality is involved in the recognition process with a different weight, which is dynamically adapted during this process mainly according to the signal-to-noise ratio provided as a contextual input. We tested this method on continuous hidden Markov model-based systems developed according to direct identification (DI), separate identification (SI) and hybrid identification (DI + SI) strategies. Experiments performed under various noise-level conditions show that the DI + SI based system is the most promising one when compared to both DI and SI-based systems for a speaker-dependent continuous-spelling of French letters recognition task. They also confirm that using adaptive modality weights instead of fixed weights allows for performance improvement and that weight estimation could benefit from using visemes as decision units for the visual recogniser in SI and DI + SI based systems.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.