Abstract

The synthesized speech quality evaluation is one of the important steps to ensure the generated speech audio sounds good to humans. There are two main approaches to perform the evaluation; subjective and objective. Subjective approaches use human as the assessor, which is the most natural approach. However, it is time-consuming and expensive. Hence, it has generally been replaced by the quicker and cheaper objective approaches. Nevertheless, since objective approaches only analyze the audio features, the predicted quality might not correlated to what humans would perceive. Recent studies shows that brain activity contains some information that can be useful to enhance the prediction performance. This work proposed a method to extract the common features among participants’ brain activity to predict the perceived speech audio quality. The result shows that the proposed approach significantly reduces the prediction error.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call