Abstract

Audio-visual speech recognition (AVSR) utilizes both audio and video modalities for the robust automatic speech recognition. Most deep neural network (DNN) has achieved promising performances in AVSR owing to its generalized and nonlinear mapping ability. However, these DNN models have two main disadvantages: 1) the first disadvantage is that most models alleviate the AVSR problems neglecting the fact that the frames are correlated; and 2) the second disadvantage is the feature learned by the mentioned models is not credible. This is because the joint representation learned by the fusion fails to consider the specific information of categories, and the discriminative information is sparse, while the noise, reverberation, irrelevant image objection, and background are redundancy. Aiming at relieving these disadvantages, we propose the auxiliary loss multimodal GRU (alm-GRU) model including three parts: feature extraction, data augmentation, and fusion & recognition. The feature extraction and data augmentation are a complete effective solution for the processing raw complete video and training, and precondition for later core part: fusion & recognition using alm-GRU equipped with a novel loss which is an end-to-end network combining both fusion and recognition, furthermore considering the modal and temporal information. The experiments show the superiority of our model and necessity of the data augmentation and generative component in the benchmark data sets.

Highlights

  • Robust automatic speech recognition is the infrastructure of the understanding and communication between human and computers

  • As the research of GAN rushed in 2017, various GANs are designed including conditional generative adversarial nets [33], deep convolutional generative adversarial networks (DCGAN) [34], dual GAN [35] and Wasserstein GAN [36].The generative component used in our model evaluate the distribution of the region around the lips and the spectrogram

  • Like the audiovisual speech recognition (AVSR) experiments of multimodal deep belief network (MDBN), the fused features after the multimodal deep autoencoder (MDAE) will pass through the mean pooling to obtain the representation of a single video, and a SVM to be the recognition part

Read more

Summary

INTRODUCTION

Robust automatic speech recognition is the infrastructure of the understanding and communication between human and computers. The reasons why the alm-GRU can relieve the problems are: first, alm-GRU captures the temporal information with the correlation between frames; second, alm-GRU equipped with a novel loss is an end-to-end network combining both fusion and recognition, which can consider the specific information of categories. This is because the joint representation learned by the alm-GRU is more discriminative, leading the more accurate recognition results. We propose a novel AVSR network for fusion & recognition, an end-to-end DNN, which considers the modal information and temporal information simultaneously.

RELATED WORK
DATA AUGMENTATION
EXPERIMENTS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call