Abstract

In this article, a lightweight and interpretable speech intelligibility prediction network is proposed. It is based on the ESTOI metric with several extensions: learned modulation filterbank, temporal attention, and taking into account robustness of a given reference recording. The proposed network is differentiable, and therefore it can be applied as a loss function in speech enhancement systems. The method was evaluated using the Clarity Prediction Challenge dataset. Compared to MB-STOI, the best of the systems proposed in this paper reduced RMSE from 28.01 to 21.33. It also outperformed best performing systems from the Clarity Challenge, while its training does not require additional labels like speech enhancement system and talker. It also has small memory and requirements, therefore, it can be potentially used as a loss function to train speech enhancement system. As it would consume less resources, the saved ones can be used for a larger speech enhancement neural network.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call