Abstract

Recently, supervised learning methods have shown promising performance, especially deep neural network-based (DNN) methods, in the application of single-channel speech enhancement. Generally, those approaches extract the acoustic features directly from the noisy speech to train a magnitude-aware target. In this paper, we propose to extract the acoustic features not only from the noisy speech but also from the pre-estimated speech, noise and phase separately, then fuse them into a new complementary feature for the purpose of obtaining more discriminative acoustic representation. In addition, on the basis of learning a magnitude-aware target, we also utilize the fusion feature to learn a phase-aware target, thereby further improving the accuracy of the recovered speech. We conduct extensive experiments, including performance comparison with some typical existing methods, generalization ability evaluation on unseen noise, ablation study, and subjective test by human listener, to demonstrate the feasibility and effectiveness of the proposed method. Experimental results prove that the proposed method has the ability to improve the quality and intelligibility of the reconstructed speech.

Highlights

  • Speech enhancement has been studied extensively as a fundamental signal processing method to reconstruct the actual received signals which are easy to be degraded by noisy adverse conditions

  • perceptual evaluation of speech quality (PESQ) can effectively estimate speech quality and its score ranges from −0.5 to 4.5

  • The good performance of deep neural network-based (DNN)-MP comes from two aspects—one is the fusion of multiple features extracted from pre-estimated speech, pre-estimated noise, and phase, beyond only from clean speech; the other is the comprehensive utilization of magnitude-aware and phase-aware training targets

Read more

Summary

Introduction

Speech enhancement has been studied extensively as a fundamental signal processing method to reconstruct the actual received signals which are easy to be degraded by noisy adverse conditions. The aim of speech enhancement is to recover and improve the speech quality and its intelligibility via different techniques and algorithms, like unsupervised methods including spectral subtraction [1,2], Wiener filtering [3], statistical model-based estimation [4,5], subband forward algorithm [6], subspace method [5,7], and so on. These unsupervised methods are based on statistical signal processing and typically work in the frequency domain. Voice activity detection (VAD) [8,9] algorithm is a simple approach to estimate and update the noise spectrum, but its performance under non-stationary noise is unsatisfactory

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call