Abstract

This paper studies the effect of combining evidences from multiple modes of speech on the recognition of different categories of sounds. Multimodal speech recognition systems are built by combining the acoustic and visual cues from the (lip radiated) normal microphone speech, throat microphone speech and lip reading for the recognition of the highly confusable 145 consonant-vowel units of the Hindi language. The performance of the multimodal systems are compared with that of the unimodal systems for the recognition of sounds based on their place (POA) and manner of articulation (MOA) as well as their associated vowels. This comparison shows that though the multimodal ASR systems rely on the presence of complimentary speech-related acoustic and visual cues present in the different modes, not all evidences are complimentary. Bimodal systems that combines visual cues from lip reading are shown to improve the recognition of sounds based on POA and MOA, but decrease the recognition of vowels. This study shows that, compared to the standard Automatic Speech Recognition(ASR) system, the best multimodal system that combines the two acoustic cues as well as visual cue improves the recognition of POA category by 11%, MOA category by 3% and vowels by 2%. However, the study shows the need for exploring better fusion techniques to overcome absence of complementary evidences in certain categories of sounds, especially in bimodal systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call