Modulation domain processing and speech phase spectrum in speech enhancement

Yi Zhang

doi:10.32469/10355/33117

Abstract

In real world scenarios, a desired speech signal is often accompanied by various kinds of interferences, such as background noise, reverberation, and competing speech. These interferences not only degrade speech perceptual quality and intelligibility which cause listening fatigue, but also hamper speech technology applications in automatic speech recognition, speaker recognition, and hearing aid systems. Therefore, purifying corrupted speech has been a hot spot of research and development in academia and industry. Speech enhancement, aimed to improve the target speech quality from interferences, includes the topics of noise reduction, speech dereverberation, and blind speech separation, etc.. The goal of noise reduction is mostly to suppress background noises while keeping the speech signal free from processing distortions as much as possible. Due to the convenience in implementation, single channel noise reduction algorithms are often used. Classical single channel noise reduction methods include spectral subtraction, Wiener filter, minimum mean square error estimation, and so on. Speech reverberation is produced from convolving a clean speech signal with the impulse response of the sound propagation path of a reverberant room, and thus one enhancement solution is to find the inverse filter to reverse the convolution effect. If considering late reverberation as an additive noise, then another possible solution could come from the noise reduction algorithms. Blind speech separation is to separate the speech signals of different sources based only on the recorded convolutive mixtures of multiple speech signals. According to the number of receiving sensor microphones, BSS can be divided into over/critical determined methods, underdetermined methods, and single channel methods, where in over/critical methods the number of sensors is more than or equal to the number of sources, in underdetermined methods the number of sensors is less than the number of sources, and in single channel methods only one sensor is used for the separation task, and it is therefore a special underdetermined case as well. In this work, we propose a novel spectral subtraction method for noisy speech enhancement. Instead of taking the conventional approach of carrying out subtraction on the magnitude spectrum in the acoustic frequency domain, we propose to perform subtraction on the real and imaginary spectra separately in the modulation frequency domain, where the method is referred to as MRISS. By doing so, we are able to enhance magnitude as well as phase through spectral subtraction. We conducted objective and subjective evaluation experiments to compare the performance of the proposed MRISS method against three existing methods, including modulation frequency domain magnitude spectral subtraction, nonlinear spectral subtraction, and minimum mean square error estimation. The objective evaluation used the criteria of segmental signal-to-noise ratio, PESQ, and average Itakura-Saito spectral distance. The subjective evaluation used a mean preference score with 14 participants. Both objective and subjective evaluation results have demonstrated that the proposed method outperformed the three existing speech enhancement methods. A further analysis has shown that the winning performance of the proposed MRISS method comes from improvements in the recovery of both acoustic magnitude and phase spectrum. We investigate applying the MRISS algorithm to the speech dereverberation task. Instead of estimating the background noise, we estimate the late reverberation spectral variance directly from the observed reverberant speech and subtracted it from the reverberant speech. Our experimental results have shown that the proposed method beat the state-of-art method of single channel multi-step-linear prediction methods in the criteria of PESQ and segmental SNR. We investigate DOA based blind speech separation method under challenging conditions, e.g., close source directions, unbalanced source energies, reverberation, and background noises. We propose using ALMM to fit the subband IPD data to improve the DOA estimation, and prove that ALMM fit the asymmetric IPD data distribution better than the conventional GMM and LMM, especially when the multiple sources’ directions are close. We propose using a log likelihood criterion to estimate the source numbers. By forming a sequence of negated log likelihood scores of the mixture model and the corresponding component models where each score targets at a source number hypothesis, we determine the source number by minimizing the negated log likelihood scores. The proposed method obtained large improvements over AIC and BIC methods when source directions are close.

Full Text