One-shot emotional voice conversion based on feature separation

Wenhuan Lu,Xinyue Zhao,Na Guo,Yongwei Li,Jianguo Wei,Jianhua Tao,Jianwu Dang

doi:10.1016/j.specom.2022.07.001

Abstract

The task of emotional voice conversion (EVC) aims to convert speech from one emotional state into another, while keeping linguistic content, speaker identity and other emotion-independent information unchanged. Because previous studies were limited to a specific set of emotions, it is challenging to realize the conversion of emotions never seen in training stage. In this paper, we propose a one-shot emotional voice conversion model based on feature separation. The proposed method could control emotional characteristics with Global Emotion Embeddings (GEEs), and introduce activation guidance (AG) and mutual information (MI) minimization to reduce the correlations between emotion embedding and emotion-independent representation. At run-time conversion, it could produce the desired emotional utterance from a single pairwise utterance without any emotion labels, whether the target emotion appears in the training set or not. The subjective and objective evaluations validate the effectiveness of our proposed model for both the degree of feature separation and emotion expression, even it could achieve unseen emotion conversion.

Full Text