Speaker-Independent Speech Recognition using Visual Features

Pooventhiran G,Manthiravalli K,Karthika Renuka,Sandeep A,Harish D

doi:10.14569/ijacsa.2020.0111175

Pooventhiran G, Manthiravalli K + Show 3 more

Open Access

https://doi.org/10.14569/ijacsa.2020.0111175

Copy DOI

Abstract

Visual Speech Recognition aims at transcribing lip movements into readable text. There have been many strides in automatic speech recognition systems that can recognize words with audio and visual speech features, even under noisy conditions. This paper focuses only on the visual features, while a robust system uses visual features to support acoustic features. We propose the concatenation of visemes (lip movements) for text classification rather than a classic individual viseme map-ping. The result shows that this approach achieves a significant improvement over the state-of-the-art models. The system has two modules; the first one extracts lip features from the input video, while the next is a neural network system trained to process the viseme sequence and classify it as text.

Highlights

Visual Speech Recognition (VSR) is the process of extracting textual or speech data from facial features through image processing techniques
The visual features are extracted in the following pipeline: the features are mean-normalized on a per-speaker basis and are decorrelated and reduced to a dimension of 40 using Linear Discriminant Analysis (LDA) and Maximum Likelihood Linear Transform (MLLT), and Speaker Adaptive Training (SAT) is applied to normalize the variation in acoustic features of different speakers
Since the number of frames is different for each word due to utterance duration variation, we fixed the number of frames to 15 and padded the sequence with fewer than 15 frames with a viseme for a closed mouth

Summary

INTRODUCTION

Visual Speech Recognition (VSR) is the process of extracting textual or speech data from facial features through image processing techniques. It plays a vital role in human-computer interaction; mostly in noisy environments, it complements Automatic Speech Recognition systems to improve performance [1][2]. Lip reading (LR) systems face problems due to variances in skin tone, speaking speed, pronunciation, and facial features. Speaker-dependent systems train on data from a single speaker and are suitable for speech and speaker verification applications [4]. Speaker-independent systems train on data from several speakers to generalize and are suitable for text transcription and voice-activated applications. It extracts lip features from each frame and stores them

Lip Feature Extraction in YIQ domain

Segmentation Method

Zernike Features

Deep Neural Networks

Shape Predictor

MIRACL-VC1

DESIGN AND IMPLEMENTATION

Pre-Processing

Face Tracking

Resizing

Convolutional Neural Network

RESULT

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2020
Citations: 2	License type: cc-by

R Discovery Prime

R Discovery Prime

Speaker-Independent Speech Recognition using Visual Features

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Automatic visual feature extraction for Mandarin audio-visual speech recognition
Tsang-Long Pao ... Tsan-Nung Wu
-
Tsang-Long Pao, et. al.Tsang-Long Pao ... Tsan-Nung Wu
01 Oct 2009
01 Oct 2009

Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition
Prashant Borde ... Amarsinh Varpe
International Journal of Speech Technology | VOL. 18
Prashant Borde, et. al.Prashant Borde ... Amarsinh Varpe
21 Oct 2014
International Journal of Speech Technology | VOL. 18

A Survey on Visual Speech Recognition Approaches
N Radha ... A Shahina
-
N Radha, et. al.N Radha ... A Shahina
25 Mar 2021
25 Mar 2021

Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise
Ibrahim Almajai ... Ben Milner
-
Ibrahim Almajai, et. al.Ibrahim Almajai ... Ben Milner
17 Sep 2006
17 Sep 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speaker-Independent Speech Recognition using Visual Features

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications