Abstract
Speech enhancement based on statistical models has been studied for several decades. Recently, the speech enhancement adopting a speech power spectral density (PSD) uncertainty model has been proposed. This approach distinguishes the true speech PSD from its estimate and considers both as random variables. It incorporates a prior distribution of speech spectra and speech PSD estimators to derive the PSD uncertainty-aware counterpart to conventional clean speech estimators, which results in performance improvement. However, the speech PSD uncertainty model has not yet been adopted for parameter estimations such as <inline-formula><tex-math notation="LaTeX"><?TeX $\mathit{a posteriori}$?></tex-math></inline-formula> speech presence probability, noise PSD, and speech power spectra estimations in the speech enhancement framework. In this paper, we incorporate the speech PSD uncertainty model to all the components of the statistical model-based speech enhancement framework by deriving PSD uncertainty-aware counterparts to conventional parameter estimators. Specifically, we derive the <inline-formula><tex-math notation="LaTeX"><?TeX $\mathit{a posteriori}$?></tex-math></inline-formula> speech presence probability (SPP) where the likelihood function for each hypothesis is based on the speech PSD uncertainty. With this <inline-formula><tex-math notation="LaTeX"><?TeX $\mathit{a posteriori}$?></tex-math></inline-formula> SPP, a novel SPP-based noise PSD estimator is derived. Also, we derive the minimum mean-square error (MMSE) estimator for the power spectrum of the clean speech in the current frame under speech PSD uncertainty which is exploited to refine the speech PSD estimator. Finally, the refined speech PSD estimator is incorporated into the spectral gain function based on the speech PSD uncertainty model. The proposed approach showed improved noise PSD estimation performance in terms of the averaged logarithmic error distance, and improved speech enhancement performance in terms of the noise reduction, segmental signal-to-noise ratio, perceptual evaluation of speech quality (PESQ) scores and short-time objective intelligibility in our experiments. It also exhibited comparable performance with a real-time deep learning-based speech enhancement system in terms of the PESQ scores and composite measures for the VoiceBank-DEMAND dataset.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.