Abstract

Audiovisual fusion is one of the most challenging tasks that continues to attract substantial research interest in the field of audiovisual automatic speech recognition (AV-ASR). In the last few decades, many approaches for integrating the audio and video modalities were proposed to enhance the performance of automatic speech recognition in both clean and noisy conditions. However, very few studies can be found in the literature that compare different fusion models for AV-ASR. Even less research work compares audiovisual fusion models for large vocabulary continuous speech recognition (LVCSR) models using deep neural networks (DNNs). This paper reviews and compares the performance of five audiovisual fusion models: the feature fusion model, the decision fusion model, the multistream hidden Markov model (HMM), the coupled HMM, and the turbo decoders. A complete evaluation of these fusion models is conducted using a standard speaker-independent DNN-based LVCSR Kaldi recipe in three experimental setups: a clean-train-clean-test, a clean-train-noisy-test, and a matched-training setup. All experiments have been applied to the recently released NTCD-TIMIT audiovisual corpus. The task of NTCD-TIMIT is phone recognition in continuous speech. Using NTCD-TIMIT with its freely available visual features and 37 clean and noisy acoustic signals allows for this study to be a common benchmark, to which novel LVCSR AV-ASR models and approaches can be compared.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call