Abstract

In this paper, we propose a novel front-end speech parameterization technique for automatic speech recognition (ASR) that is less sensitive towards ambient noise and pitch variations. First, using variational mode decomposition (VMD), we break up the short-time magnitude spectrum obtained by discrete Fourier transform into several components. In order to suppress the ill-effects of noise and pitch variations, the spectrum is then sufficiently smoothed. The desired spectral smoothing is achieved by discarding the higher-order variational mode functions and reconstructing the spectrum using the first-two modes only. As a result, the smoothed spectrum closely resembles the spectral envelope. Next, the Mel-frequency cepstral coefficients (MFCC) are extracted using the VMD-based smoothed spectra. The proposed front-end acoustic features are observed to be more robust towards ambient noise and pitch variations than the conventional MFCC features as demonstrated by the experimental evaluations presented in this study. For this purpose, we developed an ASR system using speech data from adult speakers collected under relatively clean recording conditions. State-of-the-art acoustic modeling techniques based on deep neural networks (DNN) and long short-term memory recurrent neural networks (LSTM-RNN) were employed. The ASR systems were then evaluated under noisy test conditions for assessing the noise robustness of the proposed features. To assess robustness towards pitch variations, experimental evaluations were performed on another test set consisting of speech data from child speakers. Transcribing children's speech helps in simulating an ASR task where pitch differences between training and test data are significantly large. The signal domain analyses as well as the experimental evaluations presented in this paper support our claims.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call