Abstract

This paper explains how visual information from the lips and acoustic signals can be combined together for speech segmentation. The psychological aspects of lip-reading and current automatic lip-reading systems are reviewed. The paper describes an image processing system which can extract the velocity of the lips from image sequences. The velocity of the lips is estimated by a combination of morphological image processing and block matching techniques. The resultant velocity of the lips is used to locate the syllable boundaries. This information is particularly useful when the speech signal is corrupted by noise. The paper also demonstrates the correlation between speech signals and lip information. Data fusion techniques are used to combine the acoustic and visual information for speech segmentation. The principal results show that using the combination of visual and acoustic signals can reduce segmentation errors by at least 10.4% when the signal-to-noise ratio is lower than 15 dB.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call