Abstract

We outline a method to learn fuzzy rules for visual speech recognition. Such a system could be used in automatic annotation of video sequences, to aid subsequent retrieval; it could also be used to improve the recognition of voice commands when a system has no keyboard. In the implemented system, features were extracted automatically from short video sequences, by identifying regions of the face and tracking the movement of various points around the mouth from frame to frame. The words in video sequences were segmented manually on phoneme boundaries and a rule base was constructed using two-dimensional fuzzy sets on feature and time parameters. The method was applied to the Tulips1 database and results were slightly better than those obtained with techniques based on neural networks and Hidden Markov Models. This suggests that the learned rules are speaker independent. A medium sized vocabulary of around 300 words, representative of phonemes in the English language, was created and used for training and testing. Reasonable accuracy for phoneme classification was achieved. Because of the ambiguity and similarity of various speech sounds a scheme was developed to select a group of words when a test word was presented to the system. The accuracy achieved was 21-33%, comparable to expert human lip-readers whose accuracy on nonsense words is about 30%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.