Abstract
In recent years, there is a growing interest in using visual information for automatic lipreading (Kaynak, Zhi, Cheok, Sengupta, Jian, & Chung, 2004) and visual speaker authentication (Mok, Lau, Leung, Wang, & Yan, 2004). It has been shown that visual cues, such as lip shape and lip movement, would greatly improve the performance of these systems. Various techniques have been proposed in the past decades to extract speech/speaker relevant information from lip image sequences. One approach is to extract the lip contour from lip image sequences. This generally involves lip region segmentation and lip contour modeling (Liew, Leung, & Lau, 2002; Wang, Lau, Leung, & ALiew, 2004), and the performance of the visual speech recognition and visual speaker authentication systems depends much on the accuracy and efficiency of these two procedures. Lip region segmentation aims to label the pixels in the lip image into lip and non-lip. The accuracy and robustness of the lip segmentation process is of vital importance for subsequent lip extraction. However, large variations caused by different speakers, lighting condition, or make-ups make the task difficult. The low color contrast between lip and facial skin, and the presence of facial hair, further complicate the problem. Given a correctly segmented lip region, the lip extraction process then involves fitting a lip model to the lip region. A good lip model should be compact, that is, with a small number of parameters, and should adequately represent most valid lip shapes while rejecting most invalid shapes. As most lip extraction techniques involve iterative model fitting, the efficiency of the optimization process is another important issue.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have