Abstract

Audio-based automatic speech recognition (A-ASR) systems are affected by noisy conditions in real-world applications. Adding visual cues to the ASR system is an appealing alternative to improve the robustness of the system, replicating the audiovisual perception process used during human interactions. A common problem observed when using audiovisual automatic speech recognition (AV-ASR) is the drop in performance when speech is clean. In this case, visual features may not provide complementary information, introducing variability that negatively affects the performance of the system. The experimental evaluation in this study clearly demonstrates this problem when we train an audiovisual state-of-the-art hybrid system with a deep neural network (DNN) and hidden Markov models (HMMs). This study proposes a framework that addresses this problem, improving, or at least, maintaining the performance when visual features are used. The proposed approach is a deep learning solution with a gating layer that diminishes the effect of noisy or uninformative visual features, keeping only useful information. The framework is implemented with a subset of the audiovisual CRSS-4ENGLISH-14 corpus which consists of 61 h of speech from 105 subjects simultaneously collected with multiple cameras and microphones. The proposed framework is compared with conventional HMMs with observation models implemented with either a Gaussian mixture model or DNNs. We also compare the system with a multi-stream HMM system. The experimental evaluation indicates that the proposed framework outperforms alternative methods under all configurations, showing the robustness of the gating-based framework for AV-ASR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call