Abstract

Abstract This paper implements the continuous Hindi Automatic Speech Recognition (ASR) system using the proposed integrated features vector with Recurrent Neural Network (RNN) based Language Modeling (LM). The proposed system also implements the speaker adaptation using Maximum-Likelihood Linear Regression (MLLR) and Constrained Maximum likelihood Linear Regression (C-MLLR). This system is discriminatively trained by Maximum Mutual Information (MMI) and Minimum Phone Error (MPE) techniques with 256 Gaussian mixture per Hidden Markov Model(HMM) state. The training of the baseline system has been done using a phonetically rich Hindi dataset. The results show that discriminative training enhances the baseline system performance by up to 3%. Further improvement of ~7% has been recorded by applying RNN LM. The proposed Hindi ASR system shows significant performance improvement over other current state-of-the-art techniques.

Highlights

  • Automatic Speech Recognition (ASR) is the process of taking speech utterance and converting it into text sequence as close as possible

  • We found Gammatone Frequency Cepstral Coefficients (GFCC) features more robust in comparison to Mel Frequency Cepstral Coefficient (MFCC) and Perceptual Linear Predictive Analysis (PLP) features [15]

  • The results clearly show that the combination of MFCC+GFCC+Wavelet packet based ERB Cepstral features (WERBC) with Heteroscedastic Linear Discriminant Analysis (HLDA) transformation outperforms over all other feature combinations

Read more

Summary

Introduction

ASR is the process of taking speech utterance and converting it into text sequence as close as possible. There are various number of techniques available to extract the speech features such as Mel Frequency Cepstral Coefficient (MFCC) [12], Perceptual Linear Predictive Analysis (PLP) [20], Gammatone Frequency Cepstral Coefficients (GFCC) [43, 44], Linear Prediction Cepstral Coefficients (LPCC) [49], and wavelet-based feature extraction techniques [45] Among all these techniques, MFCC is more popular as it shows promising results in clean environment conditions, but the performance of MFCC deteriorates in noisy environmental conditions. MPE and MMI discriminative techniques were used to train the acoustic model, which gave significant performance gain. Integrated acoustic features significantly improve the accuracy over traditional features It discriminatively trains the integrated feature vector using MMI and MPE discriminative techniques. The remaining part of the paper is organized as follows: Section 2 explains the concept of different feature extraction techniques, speaker adaptation, discriminative techniques, and RNN LM.

Feature Extraction
Discriminative techniques
Proposed Architecture
Proposed integrated feature set
Discriminative training
Hindi Speech Corpus
Simulation details and experiment results
Performance analysis of multiple feature combination
System combination
Performance evaluation of different systems
Experiment with language modeling
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call