Abstract

Studies on nowadays human-machine interface have demonstrated that visual information can enhance speech recognition accuracy especially in noisy environments. Deep learning has been widely used to tackle such audio visual speech recognition (AVSR) problem due to its astonishing achievements in both speech recognition and image recognition. Although existing deep learning models succeed to incorporate visual information into speech recognition, none of them simultaneously considers sequential characteristics of both audio and visual modalities. To overcome this deficiency, we proposed a multimodal recurrent neural network (multimodal RNN) model to take into account the sequential characteristics of both audio and visual modalities for AVSR. In particular, multimodal RNN includes three components, i.e., audio part, visual part, and fusion part, where the audio part and visual part capture the sequential characteristics of audio and visual modalities, respectively, and the fusion part combines the outputs of both modalities. Here we modelled the audio modality by using a LSTM RNN, and modelled the visual modality by using a convolutional neural network (CNN) plus a LSTM RNN, and combined both models by a multimodal layer in the fusion part. We validated the effectiveness of the proposed multimodal RNN model on a multi-speaker AVSR benchmark dataset termed AVletters. The experimental results show the performance improvements comparing to the known highest audio visual recognition accuracies on AVletters, and confirm the robustness of our multimodal RNN model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call