Abstract

In recent decades, the quality of speech synthesized by computers has increased drastically. However, evaluating such systems remains a challenge as the relevant methodologies haven't evolved for more than a decade. Subjective evaluation provides a global overview of the quality, but lacks any detailed feedback. Furthermore, research in objective evaluation hasn't yet delivered any detailed analysis methodologies. Inspired by the speech intelligibility and speech quality fields, we investigate how we can use an Auditory Nerve (AN) model to improve objective evaluation of speech synthesis systems. To do so, we compare different configurations of Hidden Markov Model (HMM) and deep neural network (DNN) synthesis using two different metrics derived from spectrograms, mean-rate neurograms and fine-timing neurograms. The metrics are the Root Mean Square Error (RMSE) and the Neurogram Similarity Index Measure (NSIM). As using an AN model introduces a perceptual angle in the analysis, we also compare the different configurations using two established perceptual-based quality models: Perceptual Evaluation of Speech Quality (PESQ) and Virtual Speech Quality Objective Listener (ViSQOL). The results show ViSQOL and PESQ are not suitable to a refined analysis of speech synthesis. The results also show that comparing mean-rate neurograms using the NSIM metric is an effective alternative to the comparison of spectrograms using the RMSE.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call