Abstract

We present on-line Gaussian mixture modeling (GMM) in the log-power domain of actual noisy speech and its applications to segmental signal-to-noise ratio (SNR) estimation and speech enhancement. The basic idea in this method is the use of conventional two-component GMM modeling in the log-power domain to estimate the distributions of noise and noisy speech subspaces in each speech segment of a length of 0.5–2 s. Given the subspace distributions, the statistical estimation method is adopted in the applications. For the segmental SNR estimation, the average speech level is estimated from noisy speech using a nonlinear moment of modeled distributions. This method is suitable under real conditions, when neither reference signals nor speech activity is available, and is shown to be more robust and accurate than conventional methods, particularly under low-SNR conditions. The proposed GMM model is extended to the multiband log-power domains for noise estimation. We use long-term information, which is obtained by GMM modeling in each segment of 0.5 s, to update the local distributions of noise and noisy speech power at each actual time–frequency index. The cumulative distribution function equalization (CDFE) is then used to estimate the noise and subtract it from the noisy speech power. The advantage of the CDFE method for noise estimation is that the estimation is given in the logarithmic domain without any approximation. The proposed speech enhancement is tested using the AURORA-2J database. We also compare the proposed method to the conventional minimum statistic and quantile-based noise estimation. The proposed method is found to be superior to the conventional in the speech recognition rate over most noise environments and shown to provide very good compromise between speech enhancement and speech recognition performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call