Abstract
Audiovisual speech recognition is a favorable solution to multimodality human–computer interaction. For a long time, it has been very difficult to develop machines capable of generating or understanding even fragments of natural languages; the fused sight, smelling, touching, and so on provide machines with possible mediums to perceive and understand. This article presents a detailed review of recent advances in audiovisual speech recognition area. After explicitly representing audiovisual speech recognition development phase divided by timeline, we focus on typical audiovisual speech database descriptions in terms of single view and multi-view, since the public databases for general purpose should be the first concern for audiovisual speech recognition tasks. For the following challenges that are inseparably related to the feature extraction and dynamic audiovisual fusion, the principal usefulness of deep learning-based tools, such as deep fully convolutional neural network, bidirectional long short-term memory network, 3D convolutional neural network, and so on, lies in the fact that they are relatively simple solutions of such problems. As the principle analyses and comparisons related to computational load, accuracy, and applicability of well-developed audiovisual speech recognition frameworks have been conducted, we further illuminate our insights into the future audiovisual speech recognition architecture design. We argue that end-to-end audiovisual speech recognition model and deep learning-based feature extractors will guide multimodality human–computer interaction directly to a solution.
Highlights
When it comes to the issues of human–computer interaction (HCI), human–computer harmony is as eternal a pursuit of HCI as human–computer collaboration
With mathematically expressing the audiovisual speech recognition (AVSR) process, we summarized the traditional and deep learning-based tools that are employed in AVSR systems
Considering the manners of simultaneous speech-related feature maximization and extraneous feature restrain, the solutions with respect to visual feature extractions can be classified into two categories, one is to match a statistical model of lip shape and appearance, for instance, active appearance modal (AAM) is used for visual feature acquisition via image sequences of speakers.[19,20]
Summary
When it comes to the issues of human–computer interaction (HCI), human–computer harmony is as eternal a pursuit of HCI as human–computer collaboration. Considering the manners of simultaneous speech-related feature maximization and extraneous feature restrain, the solutions with respect to visual feature extractions can be classified into two categories, one is to match a statistical model of lip shape and appearance, for instance, active appearance modal (AAM) is used for visual feature acquisition via image sequences of speakers.[19,20] Another one is to directly extract the lip features in terms of pixels without regard to the priori lip models The former is termed model-based feature extraction.[21] In AAM, high-grade features of lip shape and appearance constitute the source from which a training supervisor derives accurate coordinates of landmarks via quantitatively training phase. Consists of Chinese and English poems, 30 tongue twister, digits, Greek alphabet, and music
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.