Analysis of acoustic features affecting singing‐voice perception and its application to singing‐voice synthesis from speaking‐voice using STRAIGHT

Takeshi Saitou,Masashi Unoki,Masato Akagi

doi:10.1121/1.4787139

Abstract

A singing‐voice synthesis method that can be transformed from a speaking voice into a singing voice using STRAIGHT is proposed. This method comprises three sections: the F0 control model, spectral sequence control model, and duration control model. These models were constructed by analyzing characteristics of each acoustical feature that affects singing‐voice perception through psychoacoustic experiments. The F0 control model generates a singing‐voice F0 contour through consideration of four F0 fluctuations: overshoot, vibrato, preparation, and fine (unsteady) fluctuation that affect the naturalness of a singing voice. The spectral sequence control model modifies the speaking‐voice spectral shape into a singing‐voice spectral shape by controlling a singer’s formant, which is a remarkable peak of a spectral envelope at around 3 kHz, and amplitude modulation of formants synchronized with vibrato. The duration control model stretches the speaking‐voice phoneme duration into a singing‐voice phoneme duration based on note duration. Results show that the proposed method can synthesize a natural singing voice, whose sound quality resembles that of an actual singing voice.

Full Text