Abstract

Current ASR systems show poor performance in recognition of children’s speech in noisy environments because recognizers are typically trained with clean adults’ speech and therefore there are two mismatches between training and testing phases (i.e., clean speech in training vs. noisy speech in testing and adult speech in training vs. child speech in testing). This article studies methods to tackle the effects of these two mismatches in recognition of noisy children’s speech by investigating two techniques: data augmentation and time-scale modification. In the former, clean training data of adult speakers are corrupted with additive noise in order to obtain training data that better correspond to the noisy testing conditions. In the latter, the fundamental frequency (F0) and speaking rate of children’s speech are modified in the testing phase in order to reduce differences in the prosodic characteristics between the testing data of child speakers and the training data of adult speakers. A standard ASR system based on DNN–HMM was built and the effects of data augmentation, F0 modification, and speaking rate modification on word error rate (WER) were evaluated first separately and then by combining all three techniques. The experiments were conducted using children’s speech corrupted with additive noise of four different noise types in four different signal-to-noise (SNR) categories. The results show that the combination of all three techniques yielded the best ASR performance. As an example, the WER value averaged over all four noise types in the SNR category of 5 dB dropped from 32.30% to 12.09% when the baseline system, in which no data augmentation or time-scale modification were used, was replaced with a recognizer that was built using a combination of all three techniques. In summary, in recognizing noisy children’s speech with ASR systems trained with clean adult speech, considerable improvements in the recognition performance can be achieved by combining data augmentation based on noise addition in the system training phase and time-scale modification based on modifying F0 and speaking rate of children’s speech in the testing phase.

Highlights

  • Automatic speech recognition (ASR) has many potential applications for children in areas such as education, games, and entertainment

  • The results indicated, as expected, that the system performance deteriorated severely: in the noise condition with signal-to-noise ratio (SNR) = 5 dB, for example, the word error rate (WER) value rose to 82.67%, 87.40%, 92.32%, and 46.12% for babble, white, factory, and volvo noise, respectively

  • In the following sub-sections, we report on the results from the experiments, which were conducted to improve the system performance step by step by first using data augmentation, time-scale modification based on F0 modification, time-scale modification based on speaking rate modification, and all of these three methods combined

Read more

Summary

Introduction

Automatic speech recognition (ASR) has many potential applications for children in areas such as education (learning new languages and other skills), games, and entertainment. ASR applications are typically used by children in noisy environments, and the data collection to cover different noise conditions is difficult. The performance of ASR systems in recognition of children’s speech degrades due to the mismatch caused by training and testing under different noise conditions and due to the mismatch caused by training the system with adults’ speech and testing with children’s speech. While the majority of publicly available ASR systems work effectively for adults’ speech in noise-free environments, their performance degrades considerably when used in noisy environments and when recognizing children’s speech in noisy environments [1,2]. In order to develop such a system, the following two techniques are taken advantage of in the current study: (1) using data augmentation based on noise addition to tackle the mismatch induced by having different noise conditions in training and testing and (2) using time-scale modification to tackle the mismatch induced by having adults’ speech in training and children’s speech in testing

Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.