Abstract
Lip reading provides useful information in speech perception and language understanding, especially when the auditory speech is degraded. However, many current automatic lip reading systems impose some restrictions on users. In this paper, we present our research efforts in the Interactive System Laboratory, towards unrestricted lip reading. We first introduce a top–down approach to automatically track and extract lip regions. This technique makes it possible to acquire visual information in real-time without limiting the user's freedom of movement. We then discuss normalization algorithms to preprocess images for different lightning conditions (global illumination and side illumination). We also compare different visual preprocessing methods such as raw image, Linear Discriminant Analysis (LDA), and Principle Component Analysis (PCA). We demonstrate the feasibility of the proposed methods by the development of a modular system for flexible human–computer interaction via both visual and acoustic speech. The system is based on an extension of the existing state-of-the-art speech recognition system, a modular Multiple State–Time Delayed Neural Network (MS–TDNN) system. We have developed adaptive combination methods at several different levels of the recognition network. The system can automatically track a speaker and extract his/her lip region in real-time. The system has been evaluated under different noisy conditions such as white noise, music, and mechanical noise. The experimental results indicate that the system can achieve up to 55% error reduction using visual information in addition to the acoustic signal.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have