An optimal 3D convolutional neural network based lipreading method

Lun He,Biyun Ding,Tao Zhang,Hao Wang

doi:10.1049/ipr2.12337

Abstract

Lipreading is a visual recognition of speech by using lip movement, which aims to recognise phrases and sentences spoken by a talking face without the audio. However, the existed models for lipreading suffer from slow training speed and insufficient performance. To accelerate the training speed of the model for lipreading, a batch group training algorithm is proposed, which groups all the data of different frames. In addition, a 3D-MouthNet-BLSTM-CTC architecture for lipreading is proposed to improve model performance. It bases on a 3D convolutional neural network, MouthNet, two Bi-LSTMs, and a CTC objective function. Experiment results in Oulu-VS2 and self-built dataset show that 96.2% accuracy rate is achieved on the Oulu-VS2 dataset, and 93.8% accuracy rate is achieved on the GRID dataset. This article is about lipreading research. It mainly uses deep learning methods to study lip-reading. A new network architecture and tests on public data sets are proposed to achieve the best results.

Full Text