Abstract

Building an automatic speech recognition (ASR) system for children is a very challenging problem especially when the domain-specific data for training is absent or insufficient. In this paper, we present our efforts towards developing a children’s ASR system in Punjabi which a low-resourced language. To begin with, since speech data from children in the case of the Punjabi language is unavailable, we first created a small speech corpus consisting of data from both adult and child speakers. Next, an ASR system was developed on a mix of adults’ and children’s speech and tested on children’s speech. Due to the differences in acoustic attributes such as formant frequency, pitch, and speaking-rate differences between adults’ and children’s speech, the developed ASR system is observed to result in a highly degraded recognition rate. To reduce the acoustic mismatch, we have explored vocal-tract length normalization (VTLN), explicit pitch, and duration modification. All the three explored approaches are observed to be highly effective. To deal with training data scarcity, the role of prosody-modification-based out-of-domain data augmentation is studied. For that purpose, the pitch and speaking-rate of adults’ speech training set are explicitly changed to render it similar to children’s speech. The original and prosody modified data are then pooled together before learning the acoustic models. Significantly reduced error rates are observed by prosody-modification-based out-of-domain data augmentation. In addition to these, we have also studied the effect of varying the number of senones, the number of hidden nodes, and hidden layers as well as early stopping resulting in 32.1% of Relative Improvement (RI) in comparison to the baseline system with varied senones.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call