Abstract

Speech is the most common way of communication among humans. People who cannot communicate through speech due to partial of total loss of the voice can benefit from Alternative and Augmentative Communication devices and Text to Speech technology. One problem of using these technologies is that the included synthetic voices might be impersonal and badly adapted to the user in terms of age, accent or even gender. In this context, the use of synthetic voices from voice banking systems is an attractive alternative. New voices can be obtained applying adaptation techniques using recordings from people with healthy voice (donors) or from the user himself/herself before losing his/her own voice. In this way, the goal is to offer a wide voice catalog to potential users. However, as there is no control over the recording or the adaptation processes, some method to control the final quality of the voice is needed. We present the work developed to automatically select the best synthetic voices using a set of objective measures and a subjective Mean Opinion Score evaluation. A prediction algorithm of the MOS has been build which correlates similarly to the most correlated individual measure.

Highlights

  • Speech is the most natural method that humans use to communicate with each other

  • We extend the initial work described in [29] by evaluating four objective measures: short time objective intelligibility (STOI), enhanced short time objective intelligibility (ESTOI), non-intrusive speech quality assessment (NISQA) and speech intelligibility in bits (SIIB)

  • We briefly describe the selected objective measures: two intrusive objective measures typically used in speech enhancement, STOI [42] and ESTOI [43]; one intrusive intelligibility measure based on information theory, SIIB [44]; and one measure based on NISQA that estimates the mean opinion score (MOS) of the naturalness of synthetic speech [45]

Read more

Summary

Introduction

Speech is the most natural method that humans use to communicate with each other. When, due to an accident or illness, one person loses the ability to speak, technology can provide solutions to mitigate the impact of his or her disability. Text-to-speech (TTS) systems are a fundamental component of the so-called alternative and augmentative communication (AAC) devices, providing a synthetic voice to speak aloud the text that has been introduced through some kind of input device, such as a keyboard or an eye-gaze-controlled device. Synthetic voice customization tries to keep those hints of personality, nonexistent in a generic or commercial synthetic voice. Studies such as [1] show our tendency to form an impression on the personality of other people from their voice (as happens with other features, such as the face, or the color of the skin). It is our belief that the use of personalized speech can help in reducing the social impact of using an electronic device for everyday communication

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.