Abstract

In recent years, methods based on discrete diffusion models have achieved state-of-the-art performances in voice generation [1,2]. In theory, the voice data can be transformed into the exact Gaussian prior distributions only when the diffusion time tends to infinity. But in real applications, the Gaussian prior distribution can only be achieved approximately in a limited time duration run by these diffusion-based methods, thus resulting in sub-optimal sound quality. In this paper, we present the SchröWave to realize the continuous transformation from exact Dirac's deltas to the target voice data distribution in finite time duration, conditioned on middle voice representation with different sizes. At the same time, in order to overcome the difficulty in calculating the score on the low-dimensional manifold of voice data during the generation process, we propose to use a two-stage diffusion and generation method, while each stage implemented by solving a conditional Schrödinger bridge problem. Our experiments on the public data set LJSpeech show that the effect is significant in both objective and subjective evaluation, and achieve the new state-of-the-art MOS of 4.53. Audio samples are available at https://schrowave.github.io.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call