Abstract

This paper proposes an Audio-Visual Speech Recognition (AVSR) model using both audio and visual speech informationto improve recognition accuracy in a clean and noisy environment. Mel frequency cepstral coefficient (MFCC) and DiscreteCosine Transform (DCT) are used to extract the effective features from audio and visual speech signal respectively. TheClassification process is performed on the combined feature vector by using one of main Deep Neural Network (DNN)architecture, Bidirectional Long-Short Term Memory (BiLSTM), in contrast to the traditional Hidden Markov Models (HMMs).The effectiveness of the proposed model is demonstrated on a multi-speakers AVSR benchmark dataset named GRID. Theexperimental results show that the early integration between audio and visual features achieved an obvious enhancement in therecognition accuracy and prove that BiLSTM is the most effective classification technique when compared to HMM. The obtainedresults when using integrated audio-visual features achieved highest recognition accuracy of 99.07%, this result demonstrates anenhancement of up to 9.28% over audio-only recognition for clean data. While for noisy data, the highest recognition accuracy forintegrated audio-visual features is 98.47% with enhancement up to 12.05% over audio-only. The main reason for BiLSTMeffectiveness is it takes into account the sequential characteristics of the speech signal. The obtained results show the performanceenhancement compared to previously obtained highest audio visual recognition accuracies on GRID, and prove the robustness ofour AVSR model (BiLSTM-AVSR).

Highlights

  • Speech understanding for human is performed by using audio and visual information e.g. movements of speaker lips and tongue, where using lip movements to identify the spoken words is known as Lipreading

  • This paper proposes audio-visual speech recognition system by extracting the visual feature in addition to the acoustic features from the speech signal; the classification process is performed by using one of the major Deep Neural Network (DNN) architectures Bidirectional Long-Short Term Memory (BiLSTM) while applying Hidden Markov Models (HMMs) at the same time to compare the obtained results

  • For each speaker we take 75% for training and 25% for testing, audio features are extracted by using Mel frequency cepstral coefficient (MFCC) with feature vector of size 13 or 39, and Discrete Cosine Transform (DCT) to extract the visual features with feature vector of size 13, audio-visual features obtained by concatenating both feature vectors

Read more

Summary

Introduction

Speech understanding for human is performed by using audio and visual information e.g. movements of speaker lips and tongue, where using lip movements to identify the spoken words is known as Lipreading. Visual feature extraction methods can be classified into three classes: 1) “Appearance or pixel based” which based on a pre-defined region of interest ROI of the lip region and supposes that the whole lip region is informative to speech recognition It depends on a traditional image compression technique e.g. Discrete Cosine Transforms (DCT) [2], Discrete Wavelet Transform (DWT), Principal Components Analysis (PCA) [5] , and linear discriminate analysis (LDA) [6]. Among these different methods, DCT has been proven to perform good or superior to others [7]. Even that appearance-based features are preferred because they do not need restricted lip shape models or hand-labeled data for training, they are vulnerable to the changes in lighting conditions, translations, or rotations of input images, Deep learning can be used to overcome these weaknesses [8]. 2) “Shape or lip contour based”, where a prior template or model is used to describe the mouth area, it faced an information loss [6] because it uses only the width and the height not the whole region of the speaker's lips, example for this method the system introduced by Chowdhary [9] where a scale-invariant feature extraction and shape-index depiction method is used to form robust object recognition system . 3) The combination of 1 and 2 which takes width and height in addition to pixel values of the ROI, example of this method the system introduced by Chan [10] which proposed a visual feature representation combined both geometric and pixel-based features to perform visual-only and audio-visual speech recognition

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.