Audio-visual feature fusion via deep neural networks for automatic speech recognition

Mohammad Hasan Rahmani,Farshad Almasganj,Seyyed Ali Seyyedsalehi

doi:10.1016/j.dsp.2018.06.004

Mohammad Hasan Rahmani, Farshad Almasganj + Show 1 more

https://doi.org/10.1016/j.dsp.2018.06.004

Copy DOI

Abstract

The brain-like functionality of the artificial neural networks besides their great performance in various areas of scientific applications, make them a reliable tool to be employed in Audio-Visual Speech Recognition (AVSR) systems. The applications of such networks in the AVSR systems extend from the preliminary stage of feature extraction to the higher levels of information combination and speech modeling. In this paper, some carefully designed deep autoencoders are proposed to produce efficient bimodal features from the audio and visual stream inputs. The basic proposed structure is modified in three proceeding steps to make better usage of the presence of the visual information from the speakers' lips Region of Interest (ROI). The performance of the proposed structures is compared to both the unimodal and bimodal baselines in a professional phoneme recognition task, under different noisy audio conditions. This is done by employing a state-of-the-art DNN-HMM hybrid as the speech classifier. In comparison to the MFCC audio-only features, the finally proposed bimodal features cause an average relative reduction of 36.9% for a range of different noisy conditions, and also, a relative reduction of 19.2% for the clean condition in terms of the Phoneme Error Rates (PER).

Full Text