Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Nikolaos Ellinas,Myrsini Christidou,Alexandra Vioni,June Sig Sung,Aimilios Chalamandaris,Pirros Tsiakoulis,Paris Mastorocostas

doi:10.1016/j.specom.2022.11.006

Nikolaos Ellinas, Myrsini Christidou + Show 5 more

Open Access

https://doi.org/10.1016/j.specom.2022.11.006

Copy DOI

Abstract

In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker’s range despite the variability that a multispeaker setting introduces.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Abstract

Talk to us

Similar Papers

More From: Speech Communication

Lead the way for us

Journal: Speech Communication	Publication Date: Nov 24, 2022
Citations: 1

Similar Papers

Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control
Myrsini Christidou ... June Sig Sung
-
Myrsini Christidou, et. al.Myrsini Christidou ... June Sig Sung
01 Jan 2020
01 Jan 2020

Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis
-
-
--
01 Jan 2015
01 Jan 2015

Weighted neural network ensemble models for speech prosody control
Harald Romsdorfer
-
Harald RomsdorferHarald Romsdorfer
06 Sep 2009
06 Sep 2009

Speech synthesis for glottal activity region processing
Nagaraj Adiga ... S R M Prasanna
International Journal of Speech Technology | VOL. 22
Nagaraj Adiga, et. al.Nagaraj Adiga ... S R M Prasanna
03 Dec 2018
International Journal of Speech Technology | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Abstract

Talk to us

Similar Papers

More From: Speech Communication