Significance of Pitch-Based Spectral Normalization for Children's Speech Recognition

Ishwar Chandra Yadav,Gayadhar Pradhan

doi:10.1109/lsp.2019.2950763

Abstract

It is well known from the literature that due to several acoustic mismatches, the recognition performances of children's speech using adult-trained-acoustic models get deteriorated. The differences in pitch and speaking rate are the two major factors that cause the acoustic mismatch between two groups of speakers. This work proposes to incorporate pitch information into an automatic speech recognition (ASR) system by exploiting the correlation between pitch and formants. By using the pitch-based spectrum normalization module in the front-end feature extraction process, the performance of mismatch ASR system is improved for children of different age groups. Further, fuzzy-based time scale modification is applied to study the effect of speaking-rate normalization on the proposed feature. The proposed feature results in relative improvement of $\text{30}{\%}$ and $\text{33}{\%}$ on DLSTM-based ASR system over the MFCC baseline without and with speaking-rate normalization, respectively.

Full Text