Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System

Yukiya Hono,Yoshihiko Nankaku,Keiichi Tokuda,Kei Hashimoto,Keiichiro Oura

doi:10.1109/taslp.2021.3104165

Yukiya Hono, Yoshihiko Nankaku + Show 3 more

Open Access

https://doi.org/10.1109/taslp.2021.3104165

Copy DOI

Abstract

This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system. In recent years, DNNs have been utilized in statistical parametric SVS systems, and DNN-based SVS systems have demonstrated better performance than conventional hidden Markov model-based ones. SVS systems are required to synthesize a singing voice with pitch and timing that strictly follow a given musical score. Additionally, singing expressions that are not described on the musical score, such as vibrato and timing fluctuations, should be reproduced. The proposed system is composed of four modules: a time-lag model, a duration model, an acoustic model, and a vocoder, and singing voices can be synthesized taking these characteristics of singing voices into account. To better model a singing voice, the proposed system incorporates improved approaches to modeling pitch and vibrato and better training criteria into the acoustic model. In addition, we incorporated PeriodNet, a non-autoregressive neural vocoder with robustness for the pitch, into our systems to generate a high-fidelity singing voice waveform. Moreover, we propose automatic pitch correction techniques for DNN-based SVS to synthesize singing voices with correct pitch even if the training data has out-of-tune phrases. Experimental results show our system can synthesize a singing voice with better timing, more natural vibrato, and correct pitch, and it can achieve better mean opinion scores in subjective evaluation tests.

Highlights

S INGING voice synthesis (SVS) is a technique of generating singing voices from musical scores
This paper presents our deep neural network (DNN)-based SVS system, “Sinsy.” Our proposed system of this paper is an extension of our previous work [9]
0.9636 0.9633 a Pitch normalization described in Section IV-A, b Skip connection described in Section IV-A, c “Sine-based” denotes sine-based vibrato modeling described in Section IV-B1, and “Diff-based” denotes the difference-based vibrato modeling described in Section IV-B2. d Trainig criteria L, L(s), and L(d)are given by (8), (4), and (6), respectively

Summary

INTRODUCTION

S INGING voice synthesis (SVS) is a technique of generating singing voices from musical scores. In SVS systems, singing voices must be synthesized accurately following the input musical score Methods such as pitch normalization [8] and data augmentation [12], [17] have been proposed for DNN-based SVS systems to generate fundamental frequency (F0) following the note pitch in the input musical score. A framework with a time-lag model and a duration model has been proposed to determine the phone durations under note length constraints considering these timing fluctuations [9] These techniques are essential for synthesizing a human-like natural singing voice. All the components for synthesizing a singing voice from the analyzed score features are based on neural networks and incorporate novel techniques to better model a singing voice.

RELATED WORK

Overview

Acoustic Model

Time-Lag Model and Duration Model

Neural Vocoder

Pitch Normalization

Vibrato Model

AUTOMATIC PITCH CORRECTION

Prior Distribution of Pitch

Pseudo-Note Pitch

Experimental Conditions

Objective Evaluation of Time-Lag Modeling and Duration Modeling

Method

Comparison of Acoustic Feature Modeling

SystemS1ystemS2ystemS3ystemS4ystemS5ystemS6ystem 7

Effectiveness of Automatic Pitch Correction Techniques

Findings

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM transactions on audio, speech, and language processing	Publication Date: Jan 1, 2021
Citations: 17	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing

Lead the way for us

Similar Papers

An On-the-Fly Mandarin Singing Voice Synthesis System
Cheng-Yuan Lin ... Shaw-Hwa Hwang
-
Cheng-Yuan Lin, et. al.Cheng-Yuan Lin ... Shaw-Hwa Hwang
01 Jan 2002
01 Jan 2002

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Jinglin Liu ... Yi Ren
Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence | VOL. 36
Jinglin Liu, et. al.Jinglin Liu ... Yi Ren
28 Jun 2022
Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence | VOL. 36

Phoneix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation With Phoneme Distribution Predictor
Yuning Wu ... Tao Qian
-
Yuning Wu, et. al.Yuning Wu ... Tao Qian
04 Jun 2023
04 Jun 2023

Recent Development of the DNN-based Singing Voice Synthesis System — Sinsy
Yukiya Hono ... Keiichi Tokuda
-
Yukiya Hono, et. al.Yukiya Hono ... Keiichi Tokuda
01 Nov 2018
01 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM transactions on audio, speech, and language processing