Improving children’s mismatched ASR using structured low-rank feature projection

S Shahnawazuddin,Hemant K Kathania,Abhishek Dey,Rohit Sinha

doi:10.1016/j.specom.2018.11.001

Abstract

The work presented in this paper explores the issues in automatic speech recognition (ASR) of children’s speech on acoustic models trained on adults’ speech. In such contexts, due to a large acoustic mismatch between training and test data, highly degraded recognition rates are noted. Even with the use of vocal tract length normalization (VTLN), the mismatched case recognition performance is still much below that for the matched case. Our earlier studies have shown that, for commonly used mel-filterbank-based cepstral features, the acoustic mismatch is exacerbated by insufficient smoothing of pitch harmonics for child speakers. To address this problem, a structured low-rank projection of the features vectors prior to learning the acoustic models as well as before decoding is proposed in this paper. To accomplish this, first a low-rank transform is learned on the training data (adults’ speech). Any dimensionality reduction technique which depends on the variance of the training data may be used for this purpose. In this work, principal component analysis and heteroscedastic linear discriminant analysis have been explored for the same. When the derived low-rank projection is applied in the mismatched testing case, it alleviates the pitch-dependent mismatch. The proposed approach provides a relative recognition performance improvement of 35% over the VTLN included baseline for the children’s mismatched ASR employing acoustic modeling based on hidden Markov models (HMM) with observation densities modeled using Gaussian mixture models (GMM). In addition to that, other acoustic modeling approaches based on subspace GMM (SGMM) and deep neural networks (DNN) have also been explored. Projecting the data to a lower-dimensional subspace is found to be effective in those frameworks as well. In the case of SGMM and DNN-based systems, the proposed approach is noted to result in relative recognition performance improvements of 33% and 21%, respectively, over their corresponding baselines.

Full Text