Abstract

Text-to-speech (TTS) systems are evolving and making way into numerous commercial systems, such as smartphones and assistive technologies. Notwithstanding, their user perceived quality-of-experience (QoE) is still low compared to natural speech, with distortions arising across numerous perceptual dimensions, such as voice pleasantness, comprehension, and appropriateness of intonation, to name a few. Unfortunately, the effects of such perceptual dimensions on overall perceived QoE is still unknown, particularly across listeners of different genders, thus making it difficult for TTS developers to further improve system quality. To overcome this limitation, this study makes use of exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and model invariance tests to shed light on factors responsible for QoE perception across natural and synthesized speech, as well as male and female listeners. Experimental EFA/CFA results on a publicly available database of commercial TTS systems showed the emergence of two key perceptual dimensions responsible for TTS QoE, namely ‘listening pleasure’ and ‘prosody’. Model invariance tests validated the reliability of the model across male and female listeners, as well as across natural and synthetic voices.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.