Abstract

We present a novel approach to derive robust speech representations for automatic speech recognition (ASR) systems. The proposed method uses an unsupervised data-driven modulation filter learning approach that preserves the key modulations of speech signal in spectro-temporal domain. This is achieved by a deep generative modeling framework to learn modulation filters using convolutional variational autoencoder (CVAE). A skip connection based CVAE enables the learning of multiple irredundant modulation filters in the time and frequency modulation domain using temporal and spectral trajectories of input spectrograms. The learnt filters are used to process the spectrogram features for ASR training. The ASR experiments are performed on Aurora-4 (additive noise with channel artifact) and CHiME-3 (additive noise with reverberation) databases. The results show significant improvements for the proposed CVAE model over the baseline features as well as other robust front-ends (average relative improvements of 9% in word error rate over baseline features on Aurora-4 database and 23% on CHiME-3 database). In addition, the performance of the proposed features is highly beneficial for semi-supervised training of ASR when reduced amounts of labeled training data are available (average relative improvements of 29% over baseline features on Aurora-4 database with 30% of the labeled training data).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call