LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES

Fatemeh Vakhshiteh,Ahmad Nickabadi,Farshad Almasganj

doi:10.5566/ias.1859

Fatemeh Vakhshiteh, Ahmad Nickabadi + Show 1 more

Open Access

https://doi.org/10.5566/ias.1859

Copy DOI

Journal: Image Analysis & Stereology	Publication Date: Jul 9, 2018
Citations: 6	License type: CC BY-NC 4.0

Affiliation: Amirkabir University of Technology

Abstract

Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.

Full Text