Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Eugene Kharitonov,Damien Vincent,Sertan Girgin,Zalán Borsos,Raphaël Marinier,Neil Zeghidour,Olivier Pietquin,Matt Sharifi,Marco Tagliasacchi

doi:10.1162/tacl_a_00618

Abstract

Abstract We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to “reading”) and from semantic tokens to low-level acoustic tokens (“speaking”). Decoupling these two tasks enables training of the “speaking” module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the “reading” component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in naturalness and acoustic quality.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Dec 21, 2023
Citations: 24	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Abstract

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
Yanpeng Zhao ... Ximing Lu
-
Yanpeng Zhao, et. al.Yanpeng Zhao ... Ximing Lu
01 Jan 2021
01 Jan 2021

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
...
-
, et. al. ...
27 Jun 2022
27 Jun 2022

Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms
Yicheng Du ... Yoshiaki Bando
-
Yicheng Du, et. al.Yicheng Du ... Yoshiaki Bando
24 Jan 2021
24 Jan 2021

SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings
Masoud Jalili Sabet ... Hinrich Schütze
-
Masoud Jalili Sabet, et. al.Masoud Jalili Sabet ... Hinrich Schütze
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Abstract

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics