Abstract

It is well known from the literature that due to several acoustic mismatches, the recognition performances of children's speech using adult-trained-acoustic models get deteriorated. The differences in pitch and speaking rate are the two major factors that cause the acoustic mismatch between two groups of speakers. This work proposes to incorporate pitch information into an automatic speech recognition (ASR) system by exploiting the correlation between pitch and formants. By using the pitch-based spectrum normalization module in the front-end feature extraction process, the performance of mismatch ASR system is improved for children of different age groups. Further, fuzzy-based time scale modification is applied to study the effect of speaking-rate normalization on the proposed feature. The proposed feature results in relative improvement of $\text{30}{\%}$ and $\text{33}{\%}$ on DLSTM-based ASR system over the MFCC baseline without and with speaking-rate normalization, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call