End-to-end visual speech recognition for small-scale datasets

Stavros Petridis,Yujiang Wang,Pingchuan Ma,Zuwei Li,Maja Pantic

doi:10.1016/j.patrec.2020.01.022

Stavros Petridis, Yujiang Wang + Show 3 more

Open Access

https://doi.org/10.1016/j.patrec.2020.01.022

Copy DOI

Abstract

Visual speech recognition models traditionally consist of two stages, feature extraction and classification. Several deep learning approaches have been recently presented aiming to replace the feature extraction stage by automatically extracting features from mouth images. However, research on simultaneously learning features and performing classification remains limited. In addition, most of the existing methods require large amounts of data in order to achieve state-of-the-art performance, otherwise they under-perform. In this work, an end-to-end lip-reading system for isolated word recognition is presented based on fully-connected layers and Long-Short Memory (LSTM) networks which is suitable for small-scale datasets. The model consists of two streams: one which extracts features directly from the mouth images and one which extracts features from the difference images. A Bidirectional LSTM (BLSTM) is used for modelling the temporal dynamics in each stream which are then fused via another BLSTM. An absolute improvement in classification rate of 0.6%, 3.4%, 3.9%, 11.4% over the state-of-the-art is reported on the OuluVS2, CUAVE, AVLetters and AVLetters2 databases, respectively.

Full Text