Abstract

This paper presents a new deep neural network (DNN)-based speech enhancement algorithm by integrating the distilled knowledge from the traditional statistical-based method. Unlike the other DNN-based methods, which usually train many different models on the same data and then average their predictions, or use a large number of noise types to enlarge the simulated noisy speech, the proposed method does not train a whole ensemble of models and does not require a mass of simulated noisy speech. It first trains a discriminator network and a generator network simultaneously using the adversarial learning method. Then, the discriminator network and generator network are re-trained by distilling knowledge from the statistical method, which is inspired by the knowledge distillation in a neural network. Finally, the generator network is fine-tuned using real noisy speech. Experiments on CHiME4 data sets demonstrate that the proposed method achieves a more robust performance than the compared DNN-based method in terms of perceptual speech quality.

Highlights

  • Single-channel speech enhancement has been studied for decades, while it is still a challenging problem in numerous application systems such as automatic speech recognition (ASR), hearing aids and hands-free mobile communication

  • One of the notable algorithms is the regression approach to speech enhancement based on deep neural networks (DNN), which is inspired by the successful introduction of DNN to acoustic modeling in ASR system [6]

  • The corpus consists of real and simulated audio data taken from the 5k WSJ0-Corpus with four different types of noise, i.e., bus (BUS), cafe (CAF), pedestrian area (PED), and street junction (STR)

Read more

Summary

Introduction

Single-channel speech enhancement has been studied for decades, while it is still a challenging problem in numerous application systems such as automatic speech recognition (ASR), hearing aids and hands-free mobile communication. One of the notable algorithms is the regression approach to speech enhancement based on deep neural networks (DNN), which is inspired by the successful introduction of DNN to acoustic modeling in ASR system [6] Another kind of popular method is using the DNN to estimate an ideal binary mask (IBM) or a smoothed ideal ratio mask (IRM) in the frequency domain, which is derived from computational auditory scene analysis for monaural speech separation [7,8]. All these methods only enhance the speech magnitude spectrum, leaving the phase spectrum unprocessed

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call