Abstract
Speech enhancement methods formulated in the short-time Fourier transform (STFT) domain vary in the statistical assumptions made on the STFT coefficients, in the optimization criteria applied or in the models of the signal components. Recently, approaches relying on a stochastic-deterministic speech model have been proposed. The deterministic part of the signal corresponds to harmonically related sinusoids, often used to represent voiced speech. The stochastic part models signal components that are not captured by the deterministic components. In this paper, we consider this scenario under a new perspective yielding three main contributions. First, a pitch-synchronous signal representation is considered and shown to be advantageous for the estimation of the harmonic model parameters. Second, we model the harmonic amplitudes in voiced speech as random variables with frequency bin dependent Gamma distributions. Finally, distinct estimators for the different models of voiced speech, unvoiced speech, and speech absence are derived. To select from the arising estimates, we take into account the mutual impact of detection and estimation by proposing a binary decision framework that is derived from a Bayesian risk function. The resulting pitch-synchronous stochastic-deterministic estimator outperforms several benchmark methods in terms of speech intelligibility and perceived quality predicted by instrumental measures for various noise types and different signal-to-noise ratios.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have