The performance of a typical speech recognition system is degraded in the presence of extrinsic sources like noise and due to the recording artifacts like reverberation. The principle of modulation filtering attempts to remove the spectro-temporal modulations of the speech signal that are more susceptible to noise while preserving the key modulations for speech recognition. While traditional approaches use modulation filters that are hand-crafted, we propose a novel method for modulation filter learning using deep variational models in this paper. Specifically, we pose the filter learning problem in a deep unsupervised generative modeling framework where the convolutional filters in the variational autoencoder capture the important speech modulations. The two-dimensional modulation filters, learned using the deep variational networks in the joint spectro-temporal domain, are used to process the spectrogram features for speech recognition task. Several speech recognition experiments are performed on a set of tasks consisting of additive noise with channel artifacts (Aurora-4), reverberation (REVERB Challenge), and additive noise with reverberation (CHiME-3). In these experiments, the proposed modulation filter learning framework shows significant improvements over the baseline features as well as various other noise robust front-ends (average relative improvements of 7.5% and 20% over the baseline features on the Aurora-4 and CHiME-3 databases respectively). Furthermore, the proposed method is also shown to be of considerable benefit for semi-supervised automatic speech recognition applications. For example, on Aurora-4 database we observe an average relative improvement of 25% over the baseline system using 30% labeled training data.
Read full abstract