Abstract

We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to improve automatic speech recognition (ASR) significantly. We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream hidden Markov models (HMM) are used to integrate audio and visual information. ASR experiments are performed under both clean and noisy audio conditions using a relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20% to 23% relative to audio-only ASR WERs, at various SNRs with additive white Gaussian noise, and by 19% relative to the audio-only ASR WER under clean audio conditions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.