Bimodality Streams Integration for Audio-Visual Speech Recognition Systems

Noraini Seman,Rosniza Roslan,Norizah Ardi,Nursuriati Jamil

doi:10.1007/978-3-319-27221-4_11

Abstract

This paper demonstrates the state-of-the-art of ‘whole-word-state Dynamic Bayesian Network (DBN)’ model of audio and visual integration. In fact, many DBN models have been proposed in recent years for speech recognition due to its strong description ability and flexible structure. DBN is a statistic model that can represent multiple collections of random variables as they evolve over time. However, DBN model with whole-word-state structure, does not allow making speech as subunit segmentation. In this study, single stream DBN (SDBN) model is proposed where speech recognition and segmentation experiments are done on audio and visual speech respectively. In order to evaluate the performances of the proposed model, the timing boundaries of the segmented syllable word is compared to those obtained from the well trained tri-phone Hidden Markov Models (HMM). Besides the word recognition results, word syllable recognition rate and segmentation outputs are also obtained from the audio and visual speech features streams. Experiment results shows that, the integration of SDBN model with perceptual linear prediction (PLP) feature stream produce higher word recognition performance rate of 98.50 % compared with the tri-phone HMM model in clean environment. Meanwhile, with the increasing noise in the audio stream, the SDBN model shows more robust promising results.

Full Text