Abstract

Effective child automatic speech recognition (ASR) systems have become increasingly important due to the growing use of interactive technology. Due to the lack of publicly available child speech databases, young child ASR systems often rely on older child or adult speech for training data. However, there is a large acoustic mismatch between child and adult speech. This study proposes a novel fundamental frequency (fo)-based frequency warping technique for both frequency normalization and data augmentation to combat this acoustic mismatch and address the lack of available child speech training data. The technique is inspired by the tonotopic distances between formants and fo, developed to model human vowel perception. The tonotopic distances are reformulated as a linear relationship between fo and vowel formants on the Mel scale. This reformulation is verified using fo and formant measurements from child utterances. The relationship is further generalized such that the frequency warping technique only relies on two parameters. The LibriSpeech ASR corpus is used for training, and both the OGI Kids’ Speech and CMU Kids Corpora are used for both training and testing. A single word ASR experiment and a continuous read speech ASR experiment are performed to evaluate the fo-based frequency normalization and data augmentation techniques. In the single word experiment, the system using fo-based frequency normalization significantly improved over the baseline system with no normalization, with a relative improvement of up to 22.3%, when the mismatch between training and testing data was large. In the continuous speech experiment, the combination of fo-based frequency normalization and data augmentation resulted in a relative improvement of 19.3% over the baseline. Additionally, in all experiments, the fo-based techniques outperformed other techniques such as vocal tract length normalization (VTLN) or vocal tract length perturbation (VTLP). Results were validated using Gaussian mixture model (GMM), deep neural network (DNN), and bidirectional long–short term memory (BLSTM) acoustic models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call