Abstract

This paper extends an earlier work on designing a speech recognition system based on Hidden Markov Model (HMM) classification technique of using visual modality in addition to audio modality[1]. Improved off traditional HMM-based Automatic Speech Recognition (ASR) accuracy is achieved by implementing a technique using either RNN-based or CNN-based approach. This research is intending to deliver two contributions: The first contribution is the methodology of choosing the visual features by comparing different visual features extraction methods like Discrete Cosine Transform (DCT), blocked DCT, and Histograms of Oriented Gradients with Local Binary Patterns (HOG+LBP), and applying different dimension reduction techniques like Principal Component Analysis (PCA), auto-encoder, Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE) to find the most effective features vector size. Then the obtained visual features are early integrated with the audio features obtained by using Mel Frequency Cepstral Coefficients (MFCCs) and feed the combined audio-visual feature vector to the classification process. The second contribution of this research is the methodology of developing the classification process using deep learning by comparing different Deep Neural Network (DNN) architectures like Bidirectional Long-Short Term Memory (BiLSTM) and Convolution Neural Network (CNN) with the traditional HMM. The proposed model is evaluated on two multi-speakers AV-ASR datasets named AVletters and GRID with different SNR. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent in GRID dataset.

Highlights

  • The main goal of designing a speech recognition system is to obtain high quality and robust model, especially in a noisy environment

  • The results are obtained by evaluating the proposed model on two well-known audio-visual speech datasets GRID and AVletters to show the effectiveness of Convolution Neural Network (CNN), Bidirectional Long-Short Term Memory (BiLSTM) and Hidden Markov Model (HMM) in AV-Automatic Speech Recognition (ASR)

  • The results show that the deep AV39DBiLSTMav AV-ASR model outperforms the other AV-ASR models with other feature types

Read more

Summary

Introduction

The main goal of designing a speech recognition system is to obtain high quality and robust model, especially in a noisy environment. Speech is a multimodal signal that depends on audio and visual modalities, so to build high quality and noise-robust speech recognition system it is important to take advantage of the different modalities of the speech signal to enhance the speech understanding process. Combining the achievement of the lipreading to the traditional audio-only automatic speech recognition system is a good choice for designing a noise-robust AV-ASR system. The first step in building noise-robust speech recognition is to carefully choose the suitable method for extracting the most informative feature from the audio and visual signals. Feature extraction is considered to be an important issue for designing the recognition system. Extracting the features from the visual signal can be divided into geometric feature-

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call