Abstract

A statistically designed Automatic Speech Recognition (ASR) system extracts features from speech signals using feature extraction methods, links the extracted features with the expected phonetics of the hypothesis using acoustic models, uses language model to add prior information about the structure of the target language. For many years, Mel-frequency Cepstral coefficients (MFCC), n-gram, and Hidden Markov Model (HMM) approaches have been used predominantly for feature extraction, language modeling and acoustic modeling, respectively. However, performance degradation of MFCC in noisy conditions and inaccuracy of HMMs while handling large vocabularies have made researchers to propose more efficient methods. The proposed work uses noise robust method Gammatone Frequency Cepstral Coefficients(GFCC) for feature extraction, trigram language modeling, and HMM-Gaussian mixture model (GMM) based acoustic modeling to implement a continuous Hindi language ASR system. Also, it applies Differential Evolution (DE) technique to refine the GFCC features and discriminative techniques to enhance performance of the acoustic model. The performance of the implemented system has been evaluated by using different feature extraction methods, variants of n-gram language modeling techniques and different discriminative techniques in clean as well as noisy conditions. Initially, the results reveal that DE optimized GFCC with HMM-Gaussian Mixture Model (GMM) acoustic modeling performs better than MFCC, PLP and MF-PLP feature extraction methods. Secondly, the experimental results show that the Minimum Phone Error (MPE) outperforms Maximum Mutual Information (MMI) and Maximum Likelihood Estimation (MLE) and trigram based language modeling gives more accurate results than unigram and bigram language modeling. Finally, it has been concluded that the continuous Hindi language ASR system implemented using DE refined GFFC feature extraction method with MPE discriminative training technique and trigram based language modeling gives better accuracy in clean as-well-as noisy environments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call