Abstract

This paper presents PeriodNet, a non-autoregressive (non-AR) waveform generative model with a new model structure for modeling periodic and aperiodic components in speech waveforms. Non-AR raw waveform generative models have enabled the fast generation of high-quality waveforms. However, the variations of waveforms that these models can reconstruct are limited by training data. In addition, typical non-AR models reconstruct a speech waveform from a single Gaussian input despite the mixture of periodic and aperiodic signals in speech. These may significantly affect the waveform generation process in some applications such as singing voice synthesis systems, which require reproducing accurate pitch and natural sounds with less periodicity, including husky and breath sounds. PeriodNet uses a parallel or series model structure to model a speech waveform to tackle these problems. Two sub-generators connected in parallel or in series take an explicit periodic and aperiodic signal (sine wave and Gaussian noise) as an input. Since PeriodNet models periodic and aperiodic components by focusing on whether these input signals are autocorrelated or not, it does not require external periodic/aperiodic decomposition during training. Experimental results show that our proposed structure improves the naturalness of generated waveforms. We also show that speech waveforms with a pitch outside of the training data range can be generated with more naturalness.

Highlights

  • S PEECH synthesis technology has been rapidly advancing with the introduction of neural networks (NNs)

  • We show that PeriodNet can model a speech waveform while appropriately separating periodic and aperiodic components during the training process by comparing it with systems that use periodic and aperiodic waveforms pre-decomposed by using explicit decomposition techniques (Section V)

  • (b) M02 (+1600 cents) components in speech waveforms called “PeriodNet.” PeriodNet consists of two sub-generators connected in parallel or in series that take a sine-based input signal and a Gaussian noise signal, respectively, and it represents a speech waveform as the sum of the outputs of both generators

Read more

Summary

INTRODUCTION

S PEECH synthesis technology has been rapidly advancing with the introduction of neural networks (NNs). Hono et al.: PeriodNet: A non-AR raw waveform generative model with a structure separating periodic and aperiodic components forms by conditioning acoustic features [12], they have succeeded in replacing the conventional vocoders by giving speech applications the benefit of generating high-quality speech waveforms [13]–[15]. They have a huge network architecture with AR mechanisms, which suffer from a slow inference speed.

NEURAL WAVEFORM GENERATIVE MODELS
DETAILS OF TRAINING FRAMEWORK
EXPERIMENTAL CONDITIONS
BM3 PM1 PM2
16 Periodic waveform
BM1 BM3 PM1 PM2 SM NAT
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.