Abstract

In conventional non-intrusive speech intelligibility estimation, reverberation is extracted from the time-frequency representation of the input by explicit filter bank processing or spectral masking. However, these filter banks and masking processes are not always optimal. We replaced these processes with convolutional neural networks using rectangular kernels restricted to the frequency direction and masking such as a self-attention mechanism. We believe that this will enable feature extraction that is optimal for intelligibility estimation and will enable its estimation with high accuracy that generalizes well to input under various conditions. We further applied this front-end CNN to a previously proposed prediction model using speech enhancement. As a result, the estimation accuracy was improved compared to conventional front-ends using fixed filter banks, and this prediction showed a correlation coefficient with the subjective evaluation of 0.84 compared to 0.80 with the fixed filter bank.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call