Automatic lips reading for audio-visual speech processing and recognition

Josef Chaloupka

doi:10.21437/interspeech.2004-428

Josef Chaloupka

https://doi.org/10.21437/interspeech.2004-428

Copy DOI

Export

Save

Cite

Publication Date: Oct 4, 2004

Citations: 2

Abstract
Full-Text
Similar Papers

Abstract

Listen

This contribution is about the method for automatic lips reading from the video picture. The results of this automatic method are used for the next audio-visual speech processing and recognition. The simple image processing method for finding of the human face in the video picture is presented here. The lips are found from the marked human face in the region of interest, where the lips are, with the help of the mathematical gradient method. This gradient method is based on the image histogram. The histogram is computed from the colour value of the region of interest. The first results for visual speech recognition of isolated words are presented in conclusion. The method described here was used for face and lips detection to help speech recognition. Today’s computer technology makes possible to use also the visual part of the data used for speech processing and recognition. The visual part of the speech doesn’t have such quantity of information about what has been said as the audio part of the speech signal. This visual part can improve, in the combination with the audio part of the signal, the recognition rate in noisy conditions (in the noisy factory, in the street, in the railway station, and so on). We want the recognition response to be very short. It means that speech recognition has to be done in real time. Image processing and recognition is very time consuming compared to audio signal speech processing and recognition. Therefore we need some fast and simple methods for this. These methods for image processing and recognition must be sufficiently robust. The parameters obtained from the found outline of the lips are used in the visual part of speech most frequently. The finding of the lips is done in two steps: The first step is to find the human face in the video picture. In this step we want to know if some human face is in the video picture or not, and if somebody speaks to the camera. The camera is connected to the computer on which runs the program for audio-visual speech recognition. The second step is to separate the region of interest containing the human lips from the detected face. This is very difficult task because the skin colour of the face can be the same like the colour of the lips. The methods based on colour or shape segmentation of the image with the human lips (face) fail in certain specific cases. We concentrate on normal cases (with normal press and temperature for people, when the person isn’t sick, and so on) for lips finding and reading for the present. The static and dynamic (from the video stream) visual parameters from the detected outline of the lips in the combination with the parameters from the audio speech signal are used for audio-visual speech recognition than. 2. Finding of the Face of the Human Subject in video picture The face detection is the first task for the quality lips reading. A lot of methods of image processing exist that solve this problem [1]. They include the methods for face detection that work with the skin colour model, shape of the face, and so on. The methods based on skin colour model utilize simple threshold segmentation or segmentation using Gaussian mixture models of the pixels from the skin. Two-mixture models for the people with the “bright” or “dark” colour of the facial skin are used most frequently [2]. We have used the method of segmentation based on one-mixture Gaussian model of the human facial skin. Our method is used only for people with the “bright” colour of facial skin at present. We would like to create a two-mixture Gaussian model for all people in the near future. This method combined with the shape segmentation is fast and satisfactory for robust face detection. The Cr, Cb colour transformation (1) of the colour elements RGB from the original colour picture was used for the creation of the one-mixture Gaussian model of the human facial skin.

Full Text