Abstract

Audio-visual speech recognition (AVSR) has contributed to improve Automatic Speech Recognition (ASR) accuracy in noisy environments. In real-world scenarios, a speaker does not always face to a camera. Therefore, it is obvious that a visual speech recognition (VSR) system should correctly recognize spoken contents not only from frontal but also non-frontal faces. In this paper, we introduce our efforts to build a new multi-angle audio-visual speech corpus: GAMVA. GAMVA consists of multi-angle lip images captured by 12 cameras, face feature points and speech sounds. So far GAMVA has 20 Japanese male subjects, and each subject uttered 25 types of Japanese daily-use greeting utterances. A baseline ASR, VSR and AVSR systems based on our previous works are built to evaluate our GAMVA data. Experimental results showed that we could recognize lip images at various angles and improve speech recognition accuracy in noisy environments.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.