Abstract

User applications such as voice-based web search, online learning, and video gaming require an effective speech recognition module to take user commands. Nowadays, even children are frequently using such tools, especially for online learning and gaming. This has increased the demand for developing a noise-robust automatic speech recognition (ASR) system that can effectively transcribe children’s data under varied ambient conditions. However, automatic recognition of children’s speech is extremely challenging due to the insufficiency of data from child speakers in the majority of the languages across the world. Consequently, in this zero-resource condition, we are forced to decode children’s speech on systems trained using adults’ data. However, the acoustic mismatch between adults’ and children’s speech, such as differences in pitch, formant frequencies, and speaking-rates, leads to highly degraded recognition performance. To enhance the recognition rate under zero-resource conditions, we have explored the role of formant and duration-modification-based out-of-domain data augmentation in this paper. For that purpose, the formant frequencies of the adults’ speech data are upscaled using warping of linear predictive coding coefficients. On pooling original and formant modified adults’ speech data into training, the mismatch in formant locations is reduced leading to better recognition performance. Further improvement in recognition rate can be achieved by simultaneously modifying the duration as well as the formant frequencies of the training data. This case of out-of-domain data augmentation has also been studied in this work and found to yield added gains. In addition to data augmentation, a noise- and pitch-robust front-end acoustic feature extraction approach exploiting higher-order spectral analysis (simple and cross-bispectrum) is also proposed in this paper. The proposed features are noise-robust due to the inherent immunity of the bispectrum towards additive noises. An added advantage of bispectrum is reduced pitch sensitivity as demonstrated in this work. This, in turn, helps alleviate the aforementioned pitch-induced acoustic mismatch. The experimental evaluations presented in this paper demonstrate that the use of proposed acoustic features, as well as the out-of-domain data augmentation techniques, are highly suited for zero-resource children’s speech recognition tasks under clean and noisy conditions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call