Abstract
Visual speech recognition refers to the identification of utterances through the movements of lips, tongue, teeth, and other facial muscles of the speaker without using the acoustic signal. This work shows the relative benefits of both static and dynamic visual speech features for improved visual speech recognition. Two approaches for visual feature extraction have been considered: (1) an image transform based static feature approach in which Discrete Cosine Transform (DCT) is applied to each video frame and 6×6 triangle region coefficients are considered as features. Principal Component Analysis (PCA) is applied over all 60 features corresponding to the video frame to reduce the redundancy; the resultant 21 coefficients are taken as the static visual features. (2) Motion segmentation based dynamic feature approach in which the facial movements are segmented from the video file using motion history images (MHI). DCT is applied to the MHI and triangle region coefficients are taken as the dynamic visual features. Two types of experiments were done one with concatenated features and another with dimension reduced feature by using PCA to identify the utterances. The left-right continuous HMMs are used as visual speech classifier to classify nine MPEG-4 standard viseme consonants. The experimental result shows that the concatenated as well as dimension reduced features improve te visual speech recognition with a high accuracy of 92.45% and 92.15% respectively.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have