The performance of an Automatic Speech Recognition System (ASR) system deteriorates while using it on children speech, due to large variations and mismatch of acoustic and linguistic variables between spoken utterances of adults and children. Another important reason for the low efficiency of ASR models is the data scarcity of children speech data for low resource-language like Punjabi. The proposed work in this paper tries to address the both challenges i.e. acoustic and linguistic variations challenge, and data scarcity problem, thereby improves performance of a children speech ASR system for Punjabi language. To handle the first issue of acoustic and linguistic variations, the proposed work uses formant modification as a spectral warping technique to reduce the variation between children speech and adult speech. Further, to address the second issue of data scarcity, this paper discusses training of ASR models on augmented children speech data. Also, the work combines well established Mel-Frequency Cepstral Coefficients (MFCC) features extraction technique with Frequency Domain Linear Prediction (FDLP) to propose MFCC-FDLP hybrid approach for front end feature extraction. For implementing the data augmentation, Tacotron 2, an end-to-end Text to Speech (TTS) generative model has been used. The proposed work uses MFCC, FDLP and hybrid MFCC + FDLP techniques for front end feature extraction, Time Delay Neural Network (TDNN) for backend acoustic modeling, and trigram language model to implement continuous Punjabi language ASR systems. To increase robustness of the proposed ASR system, we have included a batch of lexically diverse words in our pronunciation model which achieved a relative improvement of 29.44%.
Read full abstract