Differences in acoustic characteristics between children’s and adults’ speech degrade performance of automatic speech recognition systems when systems trained using adults’ speech are used to recognize children’s speech. This performance degradation is due to the acoustic mismatch between training and testing. One of the main sources of the acoustic mismatch is the difference in vocal tract resonances (formant frequencies) between adult and child speakers. The present study aims to reduce the mismatch in formant frequencies by modifying formants of children’s speech to better correspond to formants of adults’ speech. This is carried out by warping the linear prediction (LP) spectrum computed from children’s speech. The warped LP spectra computed in a frame-based manner from children’s speech are used with the corresponding LP residuals to synthesize speech whose formant structure is closer to that of adults’ speech. When used in testing of an ASR system trained using adults’ speech, the warping reduces the spectral mismatch in speech between training and testing and improves the system performance in recognition of children’s speech. Experiments were conducted using narrowband (8 kHz) and wideband (16 kHz) speech of adult and child speakers from the WSJCAM0 and PF_STAR databases, respectively, and by recognizing children’s speech using acoustic models trained with adults’ speech. The proposed method gave relative improvements of 24% and 11% for the DNN and TDNN acoustic models, respectively, for narrowband speech. For wideband speech, the technique gave relative improvements of 27% and 13% for the DNN and TDNN acoustic models, respectively. The performance of the proposed method was also compared to two speaker adaptation methods: vocal tract length normalization (VTLN) and speaking rate adaptation (SRA). This comparison showed the best recognition performance for the proposed method. We also combined the proposed method with VTLN and SRA, and found that the combined method gave a further reduction in WER. Moreover, our experiments carried out for noisy speech using various types of additive noise and signal-to-noise ratios showed that the proposed method performs well also for degraded speech.
Read full abstract