Abstract
Estimating time-frequency domain masks for speech enhancement using deep learning approaches has recently become a popular field in research. In this paper, we propose a novel components loss (CL) for the training of neural networks for speech enhancement. During the training process, the proposed CL offers separate control over suppression of the noise component and preservation of the speech component. We illustrate the potential of the proposed CL by example of a convolutional neural network (CNN) for mask-based speech enhancement. We show improvement in almost all employed instrumental quality metrics over the baseline losses, which comprises the conventional mean squared error (MSE) loss and also perceptual evaluation of speech quality (PESQ) loss. On average, more than 0.3 dB higher SNR improvement and an at least 0.1 points higher PESQ score on the speech component are obtained. In addition to that, a more naturally sounding residual noise and a consistently best PESQ on the enhanced speech is obtained. All improvements are more distinct at low SNR conditions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.