Abstract

We propose a novel technique to estimate the channel characteristics for robust speech recognition. The method focuses on reliable time–frequency speech patches which are highly independent of the noise condition. Combined with a root-based approximation of the logarithm in the MFCC computation, this reduces the variance caused by the noise on the spectral features, and therefore also the constrain on the acoustic model in a multi-style training setup. We show that compared to the standard mean normalization, the proposed method estimates the channel equally well under clean conditions and better under noisy conditions. When integrated in the feature extraction pipeline, we show improvements in speech recognition accuracy on noisy speech and a status quo on clean speech. Our experiments reveal that this method helps the most for generative models that need to model the complex noise variability, and less so for discriminative models, which can learn to ignore noise instead of accurately modeling it. Our approach outperforms the state of the art on the noisy Aurora4 task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call