Abstract
This paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. EVC methods without duration mapping generate emotional speech with identical duration to that of the neutral input speech. In reality, even the same sentences would have different speeds and rhythms depending on the emotions. To solve this, the proposed method adopts a sequence-to-sequence network with an attention module that enables the network to learn attention in the neutral input sequence should be focused on which part of the emotional output sequence. Besides, to capture the multi-attribute aspects of emotional variations, an emotion encoder is designed for transforming acoustic features into emotion embedding vectors. By aggregating the emotion embedding vectors for each emotion, a representative vector for the target emotion is obtained and weighted to reflect emotion strength. By introducing a speaker encoder, the proposed method can preserve speaker identity even after the emotion conversion. Objective and subjective evaluation results confirm that the proposed method is superior to other previous works. Especially, in emotion strength control, we achieve in getting successful results.
Highlights
Voice conversion (VC) refers to a technique of converting voice characteristics while preserving the linguistic information of an input utterance
We propose a novel emotional voice conversion (EVC) method that can synthesize emotional output speech with adjusted duration using a sequence-to-sequence network
Inspired by the controllable emotion strength in TTS, we propose an EVC method that controls the degree of the target emotion
Summary
Voice conversion (VC) refers to a technique of converting voice characteristics while preserving the linguistic information of an input utterance. To preserve the linguistic information in the content embedding matrices, the source and the target decoder outputs have to be reflected in the overall loss as shown below. The attention module is quite costly to learn; we consider that the acoustic features of the source and the target are a pair of parallel sequences with identical linguistic contents uttered with neutral and emotional speaking styles, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have