Speech Recognition Using Historian Multimodal Approach

Eslam Elmaghraby,Amr Gody,Mohamed Farouk

doi:10.21608/ejle.2019.59164

Abstract

This paper proposes an Audio-Visual Speech Recognition (AVSR) model using both audio and visual speech informationto improve recognition accuracy in a clean and noisy environment. Mel frequency cepstral coefficient (MFCC) and DiscreteCosine Transform (DCT) are used to extract the effective features from audio and visual speech signal respectively. TheClassification process is performed on the combined feature vector by using one of main Deep Neural Network (DNN)architecture, Bidirectional Long-Short Term Memory (BiLSTM), in contrast to the traditional Hidden Markov Models (HMMs).The effectiveness of the proposed model is demonstrated on a multi-speakers AVSR benchmark dataset named GRID. Theexperimental results show that the early integration between audio and visual features achieved an obvious enhancement in therecognition accuracy and prove that BiLSTM is the most effective classification technique when compared to HMM. The obtainedresults when using integrated audio-visual features achieved highest recognition accuracy of 99.07%, this result demonstrates anenhancement of up to 9.28% over audio-only recognition for clean data. While for noisy data, the highest recognition accuracy forintegrated audio-visual features is 98.47% with enhancement up to 12.05% over audio-only. The main reason for BiLSTMeffectiveness is it takes into account the sequential characteristics of the speech signal. The obtained results show the performanceenhancement compared to previously obtained highest audio visual recognition accuracies on GRID, and prove the robustness ofour AVSR model (BiLSTM-AVSR).

Highlights

Speech understanding for human is performed by using audio and visual information e.g. movements of speaker lips and tongue, where using lip movements to identify the spoken words is known as Lipreading
This paper proposes audio-visual speech recognition system by extracting the visual feature in addition to the acoustic features from the speech signal; the classification process is performed by using one of the major Deep Neural Network (DNN) architectures Bidirectional Long-Short Term Memory (BiLSTM) while applying Hidden Markov Models (HMMs) at the same time to compare the obtained results
For each speaker we take 75% for training and 25% for testing, audio features are extracted by using Mel frequency cepstral coefficient (MFCC) with feature vector of size 13 or 39, and Discrete Cosine Transform (DCT) to extract the visual features with feature vector of size 13, audio-visual features obtained by concatenating both feature vectors

Summary

Introduction

Speech understanding for human is performed by using audio and visual information e.g. movements of speaker lips and tongue, where using lip movements to identify the spoken words is known as Lipreading. Visual feature extraction methods can be classified into three classes: 1) “Appearance or pixel based” which based on a pre-defined region of interest ROI of the lip region and supposes that the whole lip region is informative to speech recognition It depends on a traditional image compression technique e.g. Discrete Cosine Transforms (DCT) [2], Discrete Wavelet Transform (DWT), Principal Components Analysis (PCA) [5] , and linear discriminate analysis (LDA) [6]. Among these different methods, DCT has been proven to perform good or superior to others [7]. Even that appearance-based features are preferred because they do not need restricted lip shape models or hand-labeled data for training, they are vulnerable to the changes in lighting conditions, translations, or rotations of input images, Deep learning can be used to overcome these weaknesses [8]. 2) “Shape or lip contour based”, where a prior template or model is used to describe the mouth area, it faced an information loss [6] because it uses only the width and the height not the whole region of the speaker's lips, example for this method the system introduced by Chowdhary [9] where a scale-invariant feature extraction and shape-index depiction method is used to form robust object recognition system . 3) The combination of 1 and 2 which takes width and height in addition to pixel values of the ROI, example of this method the system introduced by Chan [10] which proposed a visual feature representation combined both geometric and pixel-based features to perform visual-only and audio-visual speech recognition

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Speech Recognition Using Historian Multimodal Approach

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Egyptian Journal of Language Engineering

Lead the way for us

Journal: The Egyptian Journal of Language Engineering	Publication Date: Sep 23, 2019
License type: cc-by

Similar Papers

Audio-Visual Speech Recognition Using LSTM and CNN
Eslam E El Maghraby ... M Hesham Farouk
Recent Advances in Computer Science and Communications | VOL. 14
Eslam E El Maghraby, et. al.Eslam E El Maghraby ... M Hesham Farouk
20 Oct 2021
Recent Advances in Computer Science and Communications | VOL. 14

Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques
Eslam Elmaghraby ... Mohamed Farouk
The Egyptian Journal of Language Engineering | VOL. 7
Eslam Elmaghraby, et. al.Eslam Elmaghraby ... Mohamed Farouk
01 Apr 2020
The Egyptian Journal of Language Engineering | VOL. 7

Measuring the effect of high-speed video data on the audio-visual speech recognition accuracy
D V Ivanko ... D A Ryumin
Information and Control Systems | VOL. -
D V Ivanko, et. al.D V Ivanko ... D A Ryumin
19 Apr 2019
Information and Control Systems | VOL. -

Audiovisual speech recognition based on a deep convolutional neural network
Shashidhar Rudregowda ... Moez Krichen
Data Science and Management | VOL. 7
Shashidhar Rudregowda, et. al.Shashidhar Rudregowda ... Moez Krichen
06 Oct 2023
Data Science and Management | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speech Recognition Using Historian Multimodal Approach

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Egyptian Journal of Language Engineering