Improved Speech Enhancement Considering Speech PSD Uncertainty

Minseung Kim,Jong Won Shin

doi:10.1109/taslp.2022.3180676

Abstract

Speech enhancement based on statistical models has been studied for several decades. Recently, the speech enhancement adopting a speech power spectral density (PSD) uncertainty model has been proposed. This approach distinguishes the true speech PSD from its estimate and considers both as random variables. It incorporates a prior distribution of speech spectra and speech PSD estimators to derive the PSD uncertainty-aware counterpart to conventional clean speech estimators, which results in performance improvement. However, the speech PSD uncertainty model has not yet been adopted for parameter estimations such as <inline-formula><tex-math notation="LaTeX"><?TeX $\mathit{a posteriori}$?></tex-math></inline-formula> speech presence probability, noise PSD, and speech power spectra estimations in the speech enhancement framework. In this paper, we incorporate the speech PSD uncertainty model to all the components of the statistical model-based speech enhancement framework by deriving PSD uncertainty-aware counterparts to conventional parameter estimators. Specifically, we derive the <inline-formula><tex-math notation="LaTeX"><?TeX $\mathit{a posteriori}$?></tex-math></inline-formula> speech presence probability (SPP) where the likelihood function for each hypothesis is based on the speech PSD uncertainty. With this <inline-formula><tex-math notation="LaTeX"><?TeX $\mathit{a posteriori}$?></tex-math></inline-formula> SPP, a novel SPP-based noise PSD estimator is derived. Also, we derive the minimum mean-square error (MMSE) estimator for the power spectrum of the clean speech in the current frame under speech PSD uncertainty which is exploited to refine the speech PSD estimator. Finally, the refined speech PSD estimator is incorporated into the spectral gain function based on the speech PSD uncertainty model. The proposed approach showed improved noise PSD estimation performance in terms of the averaged logarithmic error distance, and improved speech enhancement performance in terms of the noise reduction, segmental signal-to-noise ratio, perceptual evaluation of speech quality (PESQ) scores and short-time objective intelligibility in our experiments. It also exhibited comparable performance with a real-time deep learning-based speech enhancement system in terms of the PESQ scores and composite measures for the VoiceBank-DEMAND dataset.

Full Text