Abstract
The aim of this paper is to illustrate an overview of the automatic speech recognition (ASR) module in a spoken dialog system and how it has evolved from the conventional GMM-HMM (Gaussian mixture model - hidden Markov model) architecture toward the recent nonlinear DNN-HMM (deep neural network) scheme. GMMs have dominated for a long time the baseline of speech recognition, but in the past years with the resurgence of artificial neural networks (ANNs), the former models have been surpassed in most recognition tasks. An outstanding consideration for ANNs-based acoustic model is the fact that their weights can be adjusted in two training steps: i) initialization of the weights (with or without pre-training) and ii) fine-tuning. To exemplify these frameworks, a case study is realized by using the Kaldi toolkit, employing a mid-vocabulary with a personalized speaker-independent voice corpus on a connected-words phone dialing environment operated for recognition of digit strings and personal name lists in Spanish from Mexico. The obtained results show a reasonable accuracy in the speech recognition performance through the DNN acoustic modeling. A word error rate (WER) of 1.49% for context-dependent DNN-HMM is achieved, providing a 30% relative improvement with regard to the best GMM-HMM result in these experiments (2.12% WER).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.