Abstract
This work explores the effect of mismatches between adults' and children's speech due to differences in various acoustic correlates on the automatic speech recognition performance under mismatched conditions. The different correlates studied in this work include the pitch, the speaking rate, the glottal parameters (open quotient, return quotient, and speech quotient), and the formant frequencies. An effort is made to quantify the effect of these correlates by explicitly normalizing each of them using the already existing techniques available in literature. Our initial study done on a connected digit recognition task shows that among these parameters only the formant frequencies, the pitch, and the speaking rate affect the automatic speech recognition performance. Significant improvements are obtained in the performance with normalization of these three parameters. With combined normalization of the pitch, the speaking rate, and the formant frequencies, 80% and 70% relative improvements are obtained over the baseline for children's speech and adults' speech recognition under mismatched conditions.
Highlights
In recent years, development of speech recognition systems has enhanced the use of machines and other interactive multimedia systems in diverse areas [1]
We describe the use of pitch-synchronous time-scaling (PSTS) method for transforming the average pitch, the signal duration, and the glottal parameters (OQ, return quotient (RQ), and speed quotient (SQ)) of the speech signals
This section describes our experiments to study the effect of various acoustic correlates in addressing the mismatch between adults’ and children’s speech on their recognition performance under mismatched conditions
Summary
Development of speech recognition systems has enhanced the use of machines and other interactive multimedia systems in diverse areas [1]. Children have a greater range of values with different means and variances for these parameters than adults due to anatomical and physiological changes occurring during a child’s growth [8], resulting in a high inter- and intraspeaker acoustic variability. These differences together cause the deterioration in the recognition performance of children’s speech on adults’ speech trained models and vice versa [9, 10]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: EURASIP Journal on Audio, Speech, and Music Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.