Abstract

In many deep learning based speech enhancement frameworks, the deep neural network (DNN) is usually trained under the mean square error (MSE) criterion to estimate the potential clean amplitudes. Yet, the MSE criterion is not directly connected with speech perception, which potentiates alternative training targets for further improvements in speech quality and intelligibility. This paper introduces maximizing the correlation between the enhanced speech and the reference, so as to further improve the speech quality and intelligibility of the enhanced speech synchronously. Based on the short-time Fourier transform amplitude spectra, the correlations are firstly calculated between the envelops along the time and frequency respectively. Then, the time and frequency envelop correlations are incorporated into the time-frequency correlation (TFC). Finally, the loss function is defined as 1 minus the TFC. In experiment, several loss functions containing the time and frequency envelop correlations are investigated. The system trained with the TFC loss outperforms the baseline trained with the MSE loss with promising improvements in both speech quality and intelligibility. Phoneme recognition results also confirm the effectiveness of maximizing the correlation for speech enhancement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call