We have developed a new model to predict speech intelligibility of synthetic sounds processed by nonlinear speech enhancement algorithms. The model involves two recent auditory models: the dynamic compressive gammachirp (dcGC) auditory filterbank and the speech envelope power spectrum model (sEPSM). The dcGC-sEPSM was compared with commonly used prediction models based on perceptual intelligibility scores of speech sounds enhanced by classic spectral subtraction and state-of-the-art Wiener filtering. As a result, the dcGC-sEPSM predicted the scores better than the coherence SII (CSII), the short-time objective intelligibility (STOI), and the original sEPSM using the gammatone filterbank. There was, however, still inconsistency between the prediction and data. In this work, we show the analysis of acoustic features used in the prediction models. The CSII calculates the magnitude-squared coherence between clean and processed spectra to derive a signal-to-distortion ratio. The STOI calculates the correlation coefficients between the short-time frame vectors of clean and degraded sound at the output of the one-third octave filterbank. The sEPSM calculates the signal-to-noise ratio in the envelope modulation domain at the output of the auditory filterbank. We summarize the methods and discuss desirable features that improve speech intelligibility predictions.
Read full abstract