Abstract

It is widely accepted that the visual modality of speech provides complementary information to the speech recognition task, and many models have been introduced in order to make good use of the visual channel. This article develops Taris, a fully differentiable neural network model capable of decoding both audio-only and audio-visual speech in real time. We achieve this by connecting our previously proposed models AV Align and Taris, which are both end-to-end differentiable approaches to audio-visual speech integration and online speech recognition respectively. We evaluate AV Taris under the same conditions as AV Align and Taris on one of the largest publicly available audio-visual speech datasets, LRS2. Our results show that AV Taris is superior to the audio-only variant of Taris, demonstrating the utility of the visual modality to speech recognition within the real time decoding framework defined by Taris. Compared to an equivalent Transformer-based AV Align model that takes advantage of full sentences without meeting the real-time requirement, we report an absolute degradation of approximately 3% with AV Taris. As opposed to the more popular alternative for online speech recognition, namely the RNN Transducer, Taris offers a greatly simplified fully differentiable training pipeline. We speculate that AV Taris has the potential to popularise the adoption of Audio-Visual Speech Recognition (AVSR) technology and overcome the inherent limitations of the audio modality in less optimal listening conditions.11Our code is publicly available at https://github.com/georgesterpu/Taris.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call