Abstract

Non-intrusive speech quality assessment is vital for objectively rating vast speech datasets without pristine reference signals. However, most existing research emphasizes speech clarity, neglecting tempo and rhythm. Different from normal Chinese speech, the prosody of classical Chinese poetry conveys profound emotion and a unique classical beauty. To assess the speech quality of classical Chinese poetry recitals across varied reciters and recording conditions, it's essential to design modules evaluating both clarity and prosody. In this paper, we propose a non-intrusive assessment method, utilizing perceptual and acoustic features as key indicators. First, by extracting the poetry's pitch frequency, we quantify the reference assessment function, deriving a prosodic score based on deviation degrees. Secondly, we devise a residual-structured neural network to capture the Mel-spectrogram's perceptual features, yielding the objective MOS score. Finally, we craft a comprehensive model using second-order polynomial regression to map both the objective MOS and prosody scores to the poetry recital quality score. After experimental validation, prosody and clarity modules both show outstanding prediction accuracy. The final quality assessment score performs a high correlation with human perception.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call