Abstract
Abstract This paper implements the continuous Hindi Automatic Speech Recognition (ASR) system using the proposed integrated features vector with Recurrent Neural Network (RNN) based Language Modeling (LM). The proposed system also implements the speaker adaptation using Maximum-Likelihood Linear Regression (MLLR) and Constrained Maximum likelihood Linear Regression (C-MLLR). This system is discriminatively trained by Maximum Mutual Information (MMI) and Minimum Phone Error (MPE) techniques with 256 Gaussian mixture per Hidden Markov Model(HMM) state. The training of the baseline system has been done using a phonetically rich Hindi dataset. The results show that discriminative training enhances the baseline system performance by up to 3%. Further improvement of ~7% has been recorded by applying RNN LM. The proposed Hindi ASR system shows significant performance improvement over other current state-of-the-art techniques.
Highlights
Automatic Speech Recognition (ASR) is the process of taking speech utterance and converting it into text sequence as close as possible
We found Gammatone Frequency Cepstral Coefficients (GFCC) features more robust in comparison to Mel Frequency Cepstral Coefficient (MFCC) and Perceptual Linear Predictive Analysis (PLP) features [15]
The results clearly show that the combination of MFCC+GFCC+Wavelet packet based ERB Cepstral features (WERBC) with Heteroscedastic Linear Discriminant Analysis (HLDA) transformation outperforms over all other feature combinations
Summary
ASR is the process of taking speech utterance and converting it into text sequence as close as possible. There are various number of techniques available to extract the speech features such as Mel Frequency Cepstral Coefficient (MFCC) [12], Perceptual Linear Predictive Analysis (PLP) [20], Gammatone Frequency Cepstral Coefficients (GFCC) [43, 44], Linear Prediction Cepstral Coefficients (LPCC) [49], and wavelet-based feature extraction techniques [45] Among all these techniques, MFCC is more popular as it shows promising results in clean environment conditions, but the performance of MFCC deteriorates in noisy environmental conditions. MPE and MMI discriminative techniques were used to train the acoustic model, which gave significant performance gain. Integrated acoustic features significantly improve the accuracy over traditional features It discriminatively trains the integrated feature vector using MMI and MPE discriminative techniques. The remaining part of the paper is organized as follows: Section 2 explains the concept of different feature extraction techniques, speaker adaptation, discriminative techniques, and RNN LM.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have