Audiovisual speech recognition: A review and forecast

Linlin Xia,Jiashuo Cui,Yiping Gao,Xun Xu,Gang Chen

doi:10.1177/1729881420976082

Abstract

Audiovisual speech recognition is a favorable solution to multimodality human–computer interaction. For a long time, it has been very difficult to develop machines capable of generating or understanding even fragments of natural languages; the fused sight, smelling, touching, and so on provide machines with possible mediums to perceive and understand. This article presents a detailed review of recent advances in audiovisual speech recognition area. After explicitly representing audiovisual speech recognition development phase divided by timeline, we focus on typical audiovisual speech database descriptions in terms of single view and multi-view, since the public databases for general purpose should be the first concern for audiovisual speech recognition tasks. For the following challenges that are inseparably related to the feature extraction and dynamic audiovisual fusion, the principal usefulness of deep learning-based tools, such as deep fully convolutional neural network, bidirectional long short-term memory network, 3D convolutional neural network, and so on, lies in the fact that they are relatively simple solutions of such problems. As the principle analyses and comparisons related to computational load, accuracy, and applicability of well-developed audiovisual speech recognition frameworks have been conducted, we further illuminate our insights into the future audiovisual speech recognition architecture design. We argue that end-to-end audiovisual speech recognition model and deep learning-based feature extractors will guide multimodality human–computer interaction directly to a solution.

Highlights

When it comes to the issues of human–computer interaction (HCI), human–computer harmony is as eternal a pursuit of HCI as human–computer collaboration
With mathematically expressing the audiovisual speech recognition (AVSR) process, we summarized the traditional and deep learning-based tools that are employed in AVSR systems
Considering the manners of simultaneous speech-related feature maximization and extraneous feature restrain, the solutions with respect to visual feature extractions can be classified into two categories, one is to match a statistical model of lip shape and appearance, for instance, active appearance modal (AAM) is used for visual feature acquisition via image sequences of speakers.[19,20]

Summary

Introduction

When it comes to the issues of human–computer interaction (HCI), human–computer harmony is as eternal a pursuit of HCI as human–computer collaboration. Considering the manners of simultaneous speech-related feature maximization and extraneous feature restrain, the solutions with respect to visual feature extractions can be classified into two categories, one is to match a statistical model of lip shape and appearance, for instance, active appearance modal (AAM) is used for visual feature acquisition via image sequences of speakers.[19,20] Another one is to directly extract the lip features in terms of pixels without regard to the priori lip models The former is termed model-based feature extraction.[21] In AAM, high-grade features of lip shape and appearance constitute the source from which a training supervisor derives accurate coordinates of landmarks via quantitatively training phase. Consists of Chinese and English poems, 30 tongue twister, digits, Greek alphabet, and music

D Arabic numerals 0–9

Method for visual feature extraction Model

90 GAFSþD

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Robotic Systems	Publication Date: Nov 1, 2020
Citations: 13	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Audiovisual speech recognition: A review and forecast

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Robotic Systems

Lead the way for us

Similar Papers

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices.
Dmitry Ryumin ... Denis Ivanko
Sensors | VOL. 23
Dmitry Ryumin, et. al.Dmitry Ryumin ... Denis Ivanko
17 Feb 2023
Sensors | VOL. 23

Speaker independent audio-visual continuous speech recognition
Luhong Liang ... A.V Nefian
-
Luhong Liang, et. al. Luhong Liang ... A.V Nefian
07 Nov 2002
07 Nov 2002

Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction
Yue Zhao ... Hui Wang
International Journal of Advanced Robotic Systems | VOL. 9
Yue Zhao, et. al.Yue Zhao ... Hui Wang
01 Jan 2012
International Journal of Advanced Robotic Systems | VOL. 9

The impact of automatic exaggeration of the visual articulatory features of a talker on the intelligibility of spectrally distorted speech
Najwa Alghamdi ... Guy J Brown
Speech Communication | VOL. 95
Najwa Alghamdi, et. al.Najwa Alghamdi ... Guy J Brown
31 Aug 2017
Speech Communication | VOL. 95

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Audiovisual speech recognition: A review and forecast

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Robotic Systems