Abstract

Robust multimodal identification systems based on audio–visual information has not been thoroughly investigated yet. The aim of this work is to propose a model-based feature extraction method which employs physiological characteristics of facial muscles producing lip movements. This approach adopts the intrinsic properties of muscles such as viscosity, elasticity, and mass which are extracted from the dynamic lip model. These parameters are exclusively dependent on the neuro-muscular properties of speaker; consequently imitation of valid speakers could be reduced to a large extent. These parameters are applied to a Hidden Markov Model (HMM) audio–visual identification system. In this work a combination of audio and video features has been employed by adopting a multistream pseudo-synchronized HMM training method. The proposed model is compared to other feature extraction methods including Kalman filtering, neural networks, adaptive network fuzzy inference system (ANFIS) and auto recursive moving average. The superior performance of the proposed system is demonstrated on a large multispeaker database of continuously spoken digits, along with a sentence that is phonetically rich. The combined Kalman filtering and proposed model led to the best performance. The phonetic content of pronounced sentences is also evaluated to achieve the optimized phonetic combinations which lead to the best identification rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call