Abstract

Abstract Deep neural networks (DNNs) have gained remarkable success in speech recognition, partially attributed to the flexibility of DNN models in learning complex patterns of speech signals. This flexibility, however, may lead to serious over-fitting and hence miserable performance degradation in adverse acoustic conditions such as those with high ambient noises. We propose a noisy training approach to tackle this problem: by injecting moderate noises into the training data intentionally and randomly, more generalizable DNN models can be learned. This ‘noise injection’ technique, although known to the neural computation community already, has not been studied with DNNs which involve a highly complex objective function. The experiments presented in this paper confirm that the noisy training approach works well for the DNN model and can provide substantial performance improvement for DNN-based speech recognition.

Highlights

  • A modern automatic speech recognition (ASR) system involves three components: an acoustic feature extractor to derive representative features for speech signals, an emission model to represent static properties of the speech features, and a transitional model to depict dynamic properties of speech production

  • The dominant acoustic features in ASR are based on shorttime spectral analysis, e.g., Mel frequency cepstral coefficients (MFCC)

  • 2 Related work The noisy training approach proposed in this paper was highly motivated by the noise injection theory which has been known for decades in the neural computing community [31,32,33,34]. This paper employs this theory and contributes in two aspects: first, we examine the behavior of noise injection in Deep neural networks (DNNs) training, a more challenging task involving a huge amount of parameters; second, we study mixture of multiple noises at various levels of signal-to-noise ratios (SNR), which is beyond the conventional noise injection theory that assumes small and Gaussian-like injected noises

Read more

Summary

Introduction

A modern automatic speech recognition (ASR) system involves three components: an acoustic feature extractor to derive representative features for speech signals, an emission model to represent static properties of the speech features, and a transitional model to depict dynamic properties of speech production. The idea is simple: by injecting some noises to the input speech data when conducting DNN training, the noise patterns are expected to be learned, and the generalization capability of the resulting network is expected to be improved. Both may improve robustness of DNNbased ASR systems within noisy conditions. If the training is based on clean speech only, the flexibility provided by the DNN structure is largely wasted This is because the phone class boundaries are relatively clear with clean speech, and so the abundant parameters of DNNs tend to learn the nuanced variations of phone implementations, conditioned on a particular type of channel and/or background noise. The noise-corrupted speech is fed into the DNN input units to conduct model training

Experiments
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call