Abstract

We study three aspects of designing appearance based visual features for automatic lipreading: (a) the choice of the video region of interest (ROI) on which image transform features are obtained; (b) the extraction of speech discriminant features at each frame; (c) the use of temporal information to improve visual speech modeling. With respect to (a), we propose a ROI that includes the speaker's jaw and cheeks, in addition to the traditionally used mouth/lip region. With respect to (b) and (c), we propose the use of a two-stage linear discriminant analysis, both within a single frame and across a large number of frames. On a large-vocabulary, continuous-speech, audio-visual database, the proposed visual features result in a 13% absolute reduction in visual-only word error rate over a baseline visual front end, and in an additional 28% relative improvement in audio-visual over audio-only phonetic classification accuracy.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call