Abstract

Visual Speech Recognition aims at transcribing lip movements into readable text. There have been many strides in automatic speech recognition systems that can recognize words with audio and visual speech features, even under noisy conditions. This paper focuses only on the visual features, while a robust system uses visual features to support acoustic features. We propose the concatenation of visemes (lip movements) for text classification rather than a classic individual viseme map-ping. The result shows that this approach achieves a significant improvement over the state-of-the-art models. The system has two modules; the first one extracts lip features from the input video, while the next is a neural network system trained to process the viseme sequence and classify it as text.

Highlights

  • Visual Speech Recognition (VSR) is the process of extracting textual or speech data from facial features through image processing techniques

  • The visual features are extracted in the following pipeline: the features are mean-normalized on a per-speaker basis and are decorrelated and reduced to a dimension of 40 using Linear Discriminant Analysis (LDA) and Maximum Likelihood Linear Transform (MLLT), and Speaker Adaptive Training (SAT) is applied to normalize the variation in acoustic features of different speakers

  • Since the number of frames is different for each word due to utterance duration variation, we fixed the number of frames to 15 and padded the sequence with fewer than 15 frames with a viseme for a closed mouth

Read more

Summary

INTRODUCTION

Visual Speech Recognition (VSR) is the process of extracting textual or speech data from facial features through image processing techniques. It plays a vital role in human-computer interaction; mostly in noisy environments, it complements Automatic Speech Recognition systems to improve performance [1][2]. Lip reading (LR) systems face problems due to variances in skin tone, speaking speed, pronunciation, and facial features. Speaker-dependent systems train on data from a single speaker and are suitable for speech and speaker verification applications [4]. Speaker-independent systems train on data from several speakers to generalize and are suitable for text transcription and voice-activated applications. It extracts lip features from each frame and stores them

Lip Feature Extraction in YIQ domain
Segmentation Method
Zernike Features
Deep Neural Networks
Shape Predictor
MIRACL-VC1
DESIGN AND IMPLEMENTATION
Pre-Processing
Face Tracking
Resizing
Convolutional Neural Network
RESULT
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call