Abstract

Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model.

Highlights

  • Automatic speech recognition (ASR) has attracted much interest because speech is the most convenient, natural, and user-friendly interface to various kinds of devices

  • Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the dual cross-modality (DCM) attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model

  • We introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments, which results in a hybrid CTC/attention architecture to improve the performance of audio–visual speech recognition (AVSR) [19]

Read more

Summary

Introduction

Automatic speech recognition (ASR) has attracted much interest because speech is the most convenient, natural, and user-friendly interface to various kinds of devices. A speech signal acquired in real-world noisy environments is significantly contaminated, and the performance of ASR systems with the contaminated speech signal is seriously degraded due to the mismatch between the training and testing environments. Many approaches have been developed to accomplish robustness by compensating for the mismatch under specific conditions, most of them fail to attain robustness in real-world environments with various types of noise (e.g., [1,2,3,4,5]). Robust recognition remains a challenging but important issue in the field of ASR. As a result that visual information is not distorted by acoustic noise, visual speech recognition (known as lip reading) may play an important role in ASR in acoustically adverse environments [6].

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call