Abstract

Recognizing children’s speech on automatic speech recognition (ASR) systems developed using adults’ speech is a very challenging task. As reported by several earlier works, a severely degraded recognition performance is observed in such ASR tasks. This is mainly due to the gross mismatch in the acoustic and linguistic attributes between those two groups of speakers. One among the various identified sources of mismatch is that the vocal organs of the adult and child speakers are of significantly different dimensions. Feature-space normalization techniques are noted to effectively address the ill-effects arising from those differences. Two most commonly used approaches are the vocal-tract length normalization and the feature-space maximum-likelihood linear regression. Another important mismatch factor is the large variation in the average pitch values across the adult and child speakers. Addressing the ill-effects introduced by the pitch differences is the primary focus of the presented study. In this regard, we have explored the feasibility of explicitly changing the pitch of the children’s speech so that observed pitch differences between the two groups of speaker are reduced. In general, speech data from children is high-pitched in comparison with that from the adults’. Consequently, in this study, the pitch of the adults’ speech used for training the ASR system is kept unchanged while that for the children’s test speech data is reduced. Significant improvement in the recognition performance is noted by this explicit reduction of pitch. To conserve the critical spectral information and to avoid introducing perceptual artifacts, we have exploited timescale modification techniques for explicit pitch mapping. Furthermore, we also presented two schemes to automatically determine the factor by which the pitch of the given test data should be varied. Automatically determining the compensation factor is critical since an ASR system is expected to be accessed by both adult and child speakers. The effectiveness of proposed techniques is evaluated on adult data trained ASR systems employing different acoustic modeling approaches, viz. Gaussian mixture modeling (GMM), subspace GMM and deep neural networks (DNN). The proposed techniques are found to be highly effective in all the explored modeling paradigms. To further study the effectiveness of the proposed approaches, another DNN-based ASR system is developed on a mix of speech data from adult as well as child speakers. The use of pitch reduction is observed to be effective even in this case.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call