Abstract

This paper presents a phonetic and visemic information-based audio–visual speech recognizer (AVSR). Active appearance model (AAM) is used to extract the visual features as it finely represents the shape and appearance information extracted from jaw and lip region. Consideration of visual features along with traditional acoustic feature has been found to be promising in the complex auditory environment. However, most of the existing AVSR systems rarely faced the visual domain problems. In this work, a real world multiple camera corpus audio visual in car (AVICAR) is used for the speech recognition experiment. Texas Instruments and Massachusetts Institute of Technology (TIMIT) corpus sentence portion is used to study the performance of bimodal audio–visual speech recognizer. To consider “Mc-Guruk” effect, acoustic and visual models are trained according to phonetic and visemic information, respectively. Phonetic–visemic AVSR system shows significant improvement over phonetic AVSR system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call