Abstract
Speech quality is often measured via subjective testing, or with objective estimators of mean opinion score (MOS) such as ViSQOL or POLQA. Typical MOS-estimation frameworks use signal level features but do not use language features that have been shown to have an effect on opinion scores. If there is a conditional dependence between score and language given these signal features, introducing language and rater predictors should provide a marginal improvement in predictions. The proposed method uses Bayesian models that predict the individual opinion score instead of MOS. Several models that test various combinations of predictors were used, including predictors that capture signal features, such as frequency band similarity, as well as features that are related to the listener, such as a language and rater index. The models are fit to the ITU-T P. Supplement 23 dataset, and posterior samples are drawn from distributions of both the model parameters and the resulting opinion score outcomes. These models are compared to MOS models by integrating over posterior samples per utterance. An experiment was conducted by ablating different predictors for several types of Bayesian hierarchical models (including ordered logistic and truncated normal individual score distributions, as well as MOS distributions) to find the marginal improvement of language and rater. The models that included language and/or rater obtained significantly lower errors (0.601 versus 0.684 root-mean-square error (RMSE)) and higher correlation. Additionally, individual rater models matched or exceeded the performance of MOS models.
Highlights
M EASURING and estimating speech quality is an important task for many fields
We propose to use a Bayesian hierarchical model of an ordered categorical distribution to model individual opinion scores based on speech, listener, and language features
When subjective tests are performed, it is common to see the results reported with the sample mean μspecified along with a 95% or 99% symmetric confidence interval [μ − cσ, μ + cσ] where σis the sample standard deviation of the opinion score and c is a constant
Summary
M EASURING and estimating speech quality is an important task for many fields. In speech synthesis and coding [1], subjective measurements of quality can be used to validate novel designs, and may be especially useful when traditional objective metrics like SNR diverge from human perception. The absolute categorical ranking (ACR) test asks raters to measure the quality of speech utterances under various test conditions by assigning a score from from 1 (bad) to 5 (excellent), with recommendations for conducting the test in ITU P. The mean opinion score (MOS) can be calculated by aggregating the scores over each utterance or all the utterances within a given condition. MOS is a standard measurement that is used in research and development of many speech applications such as codecs and speech enhancement [3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.