Aggregated Multimodal Bidirectional Recurrent Model for Audiovisual Speech Recognition

Chunlin Tian,Yu Wen,Yin Tian,Peiqi Wang,Jin Yang,Yaning Shi,Yao Wu,Zhongmin Zhang,Ke Yao

doi:10.1007/978-3-030-00021-9_35

Abstract

The Audiovisual Speech Recognition (AVSR) most commonly applied to multimodal learning employs both the video and audio information to do Robust Automatic Speech Recognition. Traditionally, AVSR was regarded as the inference and projection, a lot of restrictions on the ability of it. With the in-depth study, DNN becomes an important part of the toolkit in traditional classification tools, such as automatic speech recognition, image classification, natural language processing. AVSR often use some DNN models including Multimodal Deep Autoencoders (MDAEs), Multimodal Deep Belief Network (MDBN) and Multimodal Deep Boltzmann Machine (MDBM), which are always better than the traditional methods. However, such DNN models have several shortcomings: Firstly, they can’t balance the modal fusion and temporal fusion, or even haven’t temporal fusion; Secondly, the architecture of these models isn’t end-to-end. In addition, the training and testing are cumbersome. We designed a DNN model—Aggregate\(\varvec{d}\) Mult\(\varvec{i}\)moda\(\varvec{l}\) Bidirection\(\varvec{a}\)l Recurren\(\varvec{t}\) Mod\(\varvec{e}\)l (DILATE)—to overcome such weakness. The DILATE could be not just trained and tested simultaneously, but alternatively easy to train and prevent overfitting automatically. The experiments show that DILATE is superior to traditional methods and other DNN models in some benchmark datasets.

Full Text