Abstract

The conventional automatic speech recognition (ASR) systems employ the GMM-HMM for acoustic modeling and the n-gram for language modeling. Over the last decade, the deep feed-forward neural network (DFNN) has almost replaced the GMM in acoustic modeling. The current ASR systems are predominantly based on the DFNN-HMM acoustic model and the n-gram language model (LM). Owing to better long-term context modeling ability, the recurrent neural network (RNN) based LMs have already been reported to yield lower perplexity than the n-gram LMs. Recently a variant of RNN, the longshort term memory (LSTM) has been successfully explored in acoustic modeling. Interestingly, the evaluation of an ASR system employing both RNN-based acoustic and linguistic modeling is yet to be reported. Further, we note that most of these advancements are explored in the context of adults' ASR only. Motivated by those works, in this paper we explore LSTM-based acoustic modeling combined with RNN-based LM for children's ASR. Our experimental results show that such combined RNN-based modeling is found effective in both matched and mismatched children's ASR tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call