A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION

D Ivanko,D Ryumin

doi:10.5194/isprs-archives-xliv-2-w1-2021-85-2021

Abstract

Abstract. Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible. Speech recognition using visual information is called lip-reading. The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. Based on the conducted experimental evaluations as well as on analysis of the research field we propose a novel task-oriented approach towards practical lip-reading system implementation. Its main purpose is to be some kind of a roadmap for researchers who need to build a reliable visual speech recognition system for their task. In a rough approximation, we can divide the task of lip-reading into two parts, depending on the complexity of the problem. First, if we need to recognize isolated words, numbers or small phrases (e.g. Telephone numbers with a strict grammar or keywords). Or second, if we need to recognize continuous speech (phrases or sentences). All these stages disclosed in detail in this paper. Based on the proposed approach we implemented from scratch automatic visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in current paper. It is worth noting that for the Russian speech recognition, such systems were created for the first time.

Highlights

The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice
Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible
In this paper we present the developed task-oriented approach for creating practical visual speech recognition systems

Summary

INTRODUCTION

The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. In quiet office environments, for a variety of tasks speech recognition can approach almost hundred percent of accuracy It is often achieved under the condition of a limited vocabulary and a stricted grammar. Since the early 90s, there have been several attempts to use visual information about speech in addition to acoustic information, to improve the accuracy and reliability of automatic recognition systems. In a number of studies, the developed audio-visual speech recognition systems have demonstrated better. There is little research on the effect of acoustically noisy environments on the performance of visual speech recognition systems, and quite a few studies have focused on inflectional languages (such as Russian). There is a huge difference between the recognition of analytical languages (for example, English) and inflected languages, due to the presence in the latter of a much larger number of word forms and grammatical rules

BACKGROUNDS AND RELATED RESEARCH

DATA COLLECTION AND ANALYSIS

PROPOSED TASK-ORIENTED APPROACH

Findings

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences	Publication Date: Apr 15, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences

Lead the way for us

Similar Papers

An Improved Visual Speech Recognition of Isolated Words using Combined Pixel and Geometric Features
A Shahina ... A Nayeemulla Khan
Indian Journal of Science and Technology | VOL. 9
A Shahina, et. al.A Shahina ... A Nayeemulla Khan
24 Nov 2016
Indian Journal of Science and Technology | VOL. 9

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition
Pinki Roy ... Saswati Debnath
Signal, Image and Video Processing | VOL. 15
Pinki Roy, et. al.Pinki Roy ... Saswati Debnath
11 Jun 2020
Signal, Image and Video Processing | VOL. 15

Measuring the effect of high-speed video data on the audio-visual speech recognition accuracy
M Zelezny ... D V Ivanko
Information and Control Systems | VOL. -
M Zelezny, et. al.M Zelezny ... D V Ivanko
19 Apr 2019
Information and Control Systems | VOL. -

GAMVA: A Japanese Audio-Visual Multi-Angle Speech Corpus
Ryuichi Hirose ... Yuuto Gotoh
-
Ryuichi Hirose, et. al.Ryuichi Hirose ... Yuuto Gotoh
18 Nov 2021
18 Nov 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences