A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Marc Freixes,Joan Claudi Socoró,Francesc Alías

doi:10.1186/s13636-019-0163-y

Marc Freixes, Joan Claudi Socoró + Show 1 more

Open Access

https://doi.org/10.1186/s13636-019-0163-y

Copy DOI

Abstract

Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, time-scale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.

Highlights

Text-to-speech (TTS) synthesis systems have been widely used to generate speech in several general-purpose applications, such as call-centre automation, reading emails or news, or providing travel directions, among others [1]
It is worth mentioning that early works on speech synthesis already enabled the generation of both speech and singing, as they stood on a sourcefilter model inspired by the classical acoustic theory of voice production [1]
The speech synthesis investigations moved to corpus-based approaches, deploying TTS systems based on unit selection (US), hidden Markov models (HMM) or hybrid approaches, and more

Summary

Introduction

Text-to-speech (TTS) synthesis systems have been widely used to generate speech in several general-purpose applications, such as call-centre automation, reading emails or news, or providing travel directions, among others [1]. A TTS with singing capabilities could be useful in assistive technologies, where the incorporation of songs has been proved to be an effective form of improving the. In this sense, it is worth mentioning that early works on speech synthesis already enabled the generation of both speech and singing (e.g. see [8]), as they stood on a sourcefilter model inspired by the classical acoustic theory of voice production [1]. Some approaches used diphonebased TTS systems to generate singing [9, 10], most works opted to use databases recorded for singing purposes [11,12,13]. The speech synthesis investigations moved to corpus-based approaches, deploying TTS systems based on unit selection (US), hidden Markov models (HMM) or hybrid approaches, and more

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Dec 1, 2019
Citations: 2	License type: open-access

R Discovery Prime

R Discovery Prime

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Speech Synthesis System for Marathi Accent using FESTVOX
Charansing Kayte ... Sangramsing N.Kayte
International Journal of Computer Applications | VOL. 130
Charansing Kayte, et. al.Charansing Kayte ... Sangramsing N.Kayte
17 Nov 2015
International Journal of Computer Applications | VOL. 130

Excitation Features of Speech for Emotion Recognition Using Neutral Speech as Reference
Sudarsana Reddy Kadiri ... B Yegnanarayana
Circuits, Systems, and Signal Processing | VOL. 39
Sudarsana Reddy Kadiri, et. al.Sudarsana Reddy Kadiri ... B Yegnanarayana
25 Feb 2020
Circuits, Systems, and Signal Processing | VOL. 39

Generation of emotional speech by prosody imposition on sentence, word and syllable level fragments of neutral speech
Jainath Yadav ... K Sreenivasa Rao
-
Jainath Yadav, et. al.Jainath Yadav ... K Sreenivasa Rao
01 Mar 2015
01 Mar 2015

Is text-to-speech synthesis ready for use in computer-assisted language learning?
Zöe Handley
Speech Communication | VOL. 51
Zöe HandleyZöe Handley
25 Dec 2008
Speech Communication | VOL. 51

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing