Abstract

Computational objective metrics that use reference signals have been shown to be effective forms of speech assessment in simulated environments, since they are correlated with subjective listening studies. Recent efforts have been dedicated towards effective forms of reference-less assessment to make real-world assessment more practical, but these approaches predict a limited number of assessment measures and they have not been evaluated in real-world conditions. In this work, we present a novel reference-less based framework called the attention enhanced multi-task speech assessment (AMSA) model, which provides reliable estimates of multiple objective quality and intelligibility measures in simulated and real-world environments. The multi-task learning (MTL) architecture effectively generates discriminative features that assist in improving our model’s robustness. An attention mechanism is employed to identify key features within the feature space, and it noticeably reduces the estimation errors. A classification-aided module is also included to further suppress prediction outliers. Our model achieves the state-of-the-art performance in simulated and real-world data environments, where the results are strongly correlated with the corresponding reference-based objective scores.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call