Abstract
Speech has information more than text, but under noisy environment speech sufferance from disadvantage of not properly decoded by humans and same is true with machines. speech being bimodal along with audio features if we augment visual features specifically related to lip movements. the degree of speech recognition can be improved. The objective of this work is to use audio and visual features to aid word recognition. In this work we extracted MFCC features for audio and Geometrical features of lip movements together is used in machine learning algorithm to predict the word utterances. Videos related to word utterances are extracted from TIMID database. With the statistical information related to audio and corresponding visual features from lip movements is extracted to form input feature vector to machine learning algorithm (Multi-layer perceptron). The experimental results show that using MLP we have obtained a word recognition accuracy of 91% and using KNN Classifier the accuracy attained is 61%. The results presented here have important implications for applications in HMI communication and helps hearing impaired.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have