Abstract
This paper proposes an online single-channel speech enhancement method designed to improve the quality of speech degraded by reverberation and noise. Based on an autoregressive model for the reverberation power and on a hidden Markov model for clean speech production, a Bayesian filtering formulation of the problem is derived and online joint estimation of the acoustic parameters and mean speech, reverberation, and noise powers is obtained in mel-frequency bands. From these estimates, a real-valued spectral gain is derived and spectral enhancement is applied in the short-time Fourier transform STFT domain. The method yields state-of-the-art performance and greatly reduces the effects of reverberation and noise while improving speech quality and preserving speech intelligibility in challenging acoustic environments.
Highlights
S PEECH signals captured using a distant microphone within a confined acoustic space are often corrupted by reverberation
The log-power, Sl, of the level-normalized clean speech is modeled by an Hidden Markov Model (HMM) with N states in which the state at time frame l is denoted by cl
Six different metrics were used in order to evaluate the algorithms: the Cepstrum Distance (CD) [53], the Frequencyweighted Segmental SNR (FWSegSNR) [54], the Reverberation Decay Tail (RDT ) [55], the normalized Speech-to-Reverberation Modulation energy Ratio (SRMRnorm ) [56], the Short-Time Objective Intelligibility score (STOI) [58] and the Perceptual Evaluation of Speech Quality (PESQ) [60]
Summary
S PEECH signals captured using a distant microphone within a confined acoustic space are often corrupted by reverberation. A time-frequency gain is applied to the noisy reverberant spectral coefficients in order to estimate those of the clean speech This gain is based on the estimated power spectral densities (PSDs) of the noise and late reverberation [6], [13]. The idea of using an HMM whose states represent broad speech sound classes with distinct acoustic spectra has been applied previously to speech enhancement [27]–[30] In these papers, a state-dependent spectral shape was multiplied by a time-varying speech gain to obtain prior distributions for the speech spectral coefficients; these priors were used to determine an MMSE or MAP estimate of the clean speech spectrum in an appropriate domain.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have