UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis

Xiao Zhou,Li-Rong Dai,Zhen-Hua Ling

doi:10.1109/taslp.2021.3093823

Abstract

This paper presents UnitNet, a sequence-to-sequence (Seq2Seq) acoustic model for concatenative speech synthesis. Comparing with the Tacotron2 model for Seq2Seq speech synthesis, UnitNet utilizes the phone boundaries of training data and its decoder contains autoregressive structures at both phone and frame levels. This hierarchical architecture can not only extract embedding vectors for representing phone-sized units in the corpus but also measure the dependency among consecutive units, which makes the UnitNet model capable of guiding the selection of phone-sized units for concatenative speech synthesis. A byproduct of this model is that it can also be applied to statistical parametric speech synthesis (SPSS) and improve the robustness of Seq2Seq acoustic feature prediction since it adopts interpretable transition probability prediction rather than attention mechanism for frame-level alignment. Experimental results show that our UnitNet-based concatenative speech synthesis method not only outperforms the unit selection methods using hidden Markov models and Tacotron-based unit embeddings, but also achieves better naturalness and faster inference speed than the SPSS method using FastSpeech and Parallel WaveGAN. Besides, the UnitNet-based SPSS method makes fewer synthesis errors than Tacotron2 and FastSpeech without naturalness degradation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2021
Citations: 3

Similar Papers

Performance Evaluation of Speech Synthesis Techniques for English Language
Sangramsing N Kayte ... Monica Mundada
-
Sangramsing N Kayte, et. al.Sangramsing N Kayte ... Monica Mundada
01 Jan 2015
01 Jan 2015

Generative approach using the noise generation models for DNN-based speech synthesis trained from noisy speech
Masakazu Une ... Hiroshi Saruwatari
-
Masakazu Une, et. al.Masakazu Une ... Hiroshi Saruwatari
01 Nov 2018
01 Nov 2018

Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis
Shinnosuke Takamichi ... Satoshi Nakamura
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24
Shinnosuke Takamichi, et. al.Shinnosuke Takamichi ... Satoshi Nakamura
01 Apr 2016
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24

Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis
Shinji Takaki ... Junichi Yamagishi
-
Shinji Takaki, et. al.Shinji Takaki ... Junichi Yamagishi
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing