Abstract

Building a robust Automatic Speech Recognition (ASR) system and improving recognition accuracy in adverse conditions is still a challenging task. One way to improve the robustness of an ASR system is combining information from multiple sources (streams). A multi-stream approach which handles the multiple inputs at the model level is the key contribution of our work. Standard mic (Sm), Throat mic (Tm), and Lip reading (Lr) are the various source streams that have been used. This work explores a static weighted two stream HMM (TSH) and multi-stream HMM (MSH) model for the bimodal and multimodal systems. A syllabic units of the Hindi language database categorized into three categories – Vowel, Place of Articulation (POA), and Manner of Articulation (MOA) are used for training and testing. In this study, four types of TSH are proposed for the combination of bimodal ((Sm+Tm), (Tm+Lr), (Sm+Lr), (Lm+Lm)) systems and one type MSH is proposed for multimodal (Sm+Tm+Lr) system in a synchronous and asynchronous manner. Mel Frequency Cepstral Coefficient (MFCC) features are used for Sm and Tm signals. Combined pixel-motion based features (DCT/DWT-MHI) are used for Lr signals. Among these two features, DWT outperforms than DCT and used as a feature for visual speech. Experiments were conducted for bimodal and multimodal system. The proposed MSH approach shows improvements of 1.36%, 6.21%, and 5.8% in recognition accuracy for Vowel, POA and MOA category respectively, as compared to bimodal systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call