Survey on automatic lip-reading in the era of deep learning

Adriana Fernandez-Lopez,Federico M Sukno

doi:10.1016/j.imavis.2018.07.002

Adriana Fernandez-Lopez, Federico M Sukno

Open Access

https://doi.org/10.1016/j.imavis.2018.07.002

Copy DOI

Journal: Image and Vision Computing	Publication Date: Jul 30, 2018
Citations: 83	License type: other-oa

Affiliation: Pompeu Fabra University

Abstract

In the last few years, there has been an increasing interest in developing systems for Automatic Lip-Reading (ALR). Similarly to other computer vision applications, methods based on Deep Learning (DL) have become very popular and have permitted to substantially push forward the achievable performance. In this survey, we review ALR research during the last decade, highlighting the progression from approaches previous to DL (which we refer to as traditional) toward end-to-end DL architectures. We provide a comprehensive list of the audio-visual databases available for lip-reading, describing what tasks they can be used for, their popularity and their most important characteristics, such as the number of speakers, vocabulary size, recording settings and total duration. In correspondence with the shift toward DL, we show that there is a clear tendency toward large-scale datasets targeting realistic application settings and large numbers of samples per class. On the other hand, we summarize, discuss and compare the different ALR systems proposed in the last decade, separately considering traditional and DL approaches. We address a quantitative analysis of the different systems by organizing them in terms of the task that they target (e.g. recognition of letters or digits and words or sentences) and comparing their reported performance in the most commonly used datasets. As a result, we find that DL architectures perform similarly to traditional ones for simpler tasks but report significant improvements in more complex tasks, such as word or sentence recognition, with up to 40% improvement in word recognition rates. Hence, we provide a detailed description of the available ALR systems based on end-to-end DL architectures and identify a tendency to focus on the modeling of temporal context as the key to advance the field. Such modeling is dominated by recurrent neural networks due to their ability to retain context at multiple scales (e.g. short- and long-term information). In this sense, current efforts tend toward techniques that allow a more comprehensive modeling and interpretability of the retained context.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Survey on automatic lip-reading in the era of deep learning

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Similar Papers

Lip-Reading with Limited-Data Network
Adriana Fernandez-Lopez ... Federico M Sukno
-
Adriana Fernandez-Lopez, et. al.Adriana Fernandez-Lopez ... Federico M Sukno
01 Sep 2019
01 Sep 2019

A Systematic Study and Empirical Analysis of Lip Reading Models using Traditional and Deep Learning Algorithms
R Sangeetha ... D Malathi
JOURNAL OF ADVANCED APPLIED SCIENTIFIC RESEARCH | VOL. 4
R Sangeetha, et. al. R Sangeetha ... D Malathi
07 Apr 2022
JOURNAL OF ADVANCED APPLIED SCIENTIFIC RESEARCH | VOL. 4

Deep Learning in Neural Networks: An Overview
Vidit Shukla ... Shilpa Choudhary
-
Vidit Shukla, et. al.Vidit Shukla ... Shilpa Choudhary
15 Aug 2022
15 Aug 2022

Content Fidelity of Deep Learning Methods for Clipping and Over-exposure Correction
Mekides Assefa Abebe
London Imaging Meeting | VOL. 2
Mekides Assefa AbebeMekides Assefa Abebe
20 Sep 2021
London Imaging Meeting | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Survey on automatic lip-reading in the era of deep learning

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing