Abstract

In this paper, the WaveNet with cross-attention is proposed for Audio-Visual Automatic Speech Recognition (AV-ASR) to address multimodal feature fusion and frame alignment problems between two data streams. WaveNet is usually used for speech generation and speech recognition, however, in this paper, we extent it to audiovisual speech recognition, and the cross-attention mechanism is introduced into different places of WaveNet for feature fusion. The proposed cross-attention mechanism tries to explore the correlated frames of visual feature to the acoustic feature frame. The experimental results show that the WaveNet with cross-attention can reduce the Tibetan single syllable error about 4.5% and English word error about 39.8% relative to the audio-only speech recognition, and reduce Tibetan single syllable error about 35.1% and English word error about 21.6% relative to the conventional feature concatenation method for AV-ASR.

Highlights

  • In our daily life, man-machine interaction interface for all kinds of devices is necessary

  • For the models with cross-attention in input layer we proposed, the error rates are reduced with relative reduction 4.5% in AV-WaveNet-Connectionist Temporal Classification (CTC)-A-I-7 (l = 7) for Tibetan and 39.8% in AV-WaveNet-CTC-A-I-3(l = 3) for English compared with A-WaveNet-CTC

  • These show that cross-attention mechanism introduced in input layer of WaveNet-CTC can improve the performance of model for audio-visual speech recognition

Read more

Summary

INTRODUCTION

Man-machine interaction interface for all kinds of devices is necessary. It is composed of dilated causal convolutional layers, which enlarges the receptive field by skipping input values with a certain step It is powerful for modelling the long-term dependency on speech data. To capture the effective fusion feature and address the alignment of two data streams in different frame rates, we introduce the cross-attention mechanism to WaveNet, and combine the Connectionist Temporal. Cross-attention mechanism is explored to place at input layer or hidden layer in WaveNet, to automatically learn the weights of visual frames near the current audio frame. The visual feature frames with large score provide more effective information for acoustic features, and they match with the current audio frame much more. Owing to the large receptive field of WaveNet model in high layers, it is more difficult to align the visual hidden feature frames with acoustic hidden feature frames. Our work has three contributions: (i) we introduced the cross-attention mechanism to align two modals data in feature space. (ii) we explored the effects on cross-attention mechanism for early fusion and middle fusion in WaveNet. (iii) we explore the video frame shift for cross-attention calculation to improve the speech recognition performance and computation speed

RELATED WORK
EXPERIMENT
EXPERIMENTAL SETTINGS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call