Revisiting Over-Smoothness in Text to Speech

Yi Ren ,Tie-Yan Liu ,Xu Tan ,Tao Qin ,Zhou Zhao

doi:10.48448/zf39-aw02

Abstract

Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NAR-TTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributions and the capability of modeling methods. Both simplifying data distributions and improving modeling methods can alleviate the problem. Accordingly, we first study methods reducing the complexity of data distributions. Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods. Based on these studies, we find that 1) methods that provide additional condition inputs reduce the complexity of data distributions to model, thus alleviating the over-smoothing problem and achieving better voice quality. 2) Among advanced modeling methods, Laplacian mixture loss performs well at modeling multimodal distributions and enjoys its simplicity, while GAN and Glow achieve the best voice quality while suffering from increased training or model complexity. 3) The two categories of methods can be combined to further alleviate the over-smoothness and improve the voice quality. 4) Our experiments on the multi-speaker dataset lead to similar conclusions as above and providing more variance information can reduce the difficulty of modeling the target data distribution and alleviate the requirements for model capacity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Revisiting Over-Smoothness in Text to Speech

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Revisiting Over-Smoothness in Text to Speech
Yi Ren ... Xu Tan
-
Yi Ren, et. al.Yi Ren ... Xu Tan
01 Jan 2021
01 Jan 2021

Advanced Time Domain Modeling for Electrical Engineering
-
-
--
26 Aug 2022
26 Aug 2022

Occupant behavior modeling methods for resilient building design, operation and policy at urban scale: A review
Bing Dong ... Salvatore Carlucci
Applied Energy | VOL. 293
Bing Dong, et. al.Bing Dong ... Salvatore Carlucci
21 Apr 2021
Applied Energy | VOL. 293

Lessons Learned from the U.S. Nuclear Regulatory Commission’s Digital System Risk Research
Steven A Arndt ... Alan Kuritzky
Nuclear Technology | VOL. 173
Steven A Arndt, et. al.Steven A Arndt ... Alan Kuritzky
01 Jan 2010
Nuclear Technology | VOL. 173

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Revisiting Over-Smoothness in Text to Speech

Abstract

Talk to us

Similar Papers