Data augmentation based non-parallel voice conversion with frame-level speaker disentangler

Bo Chen,Zhihang Xu,Kai Yu

doi:10.1016/j.specom.2021.10.001

Abstract

Non-parallel data voice conversion is a popular and challenging research area. The main task is to build acoustic mappings from the source speaker to the target speaker in different units (e.g., frame, phoneme, cluster, sentence). With the help of the recent high-quality speech synthesis techniques, it is possible to directly produce parallel speech using non-parallel data. This paper proposes ParaGen: a data augmentation based technique for non-parallel data voice conversion. The system consists of a speaker disentangler based text-to-speech model and a simple frame-to-frame spectrogram conversion model. The text-to-speech model takes the text and reference audio as input to produce the speech in the target speaker identity with the time-aligned local speaking style from the reference audio. The spectrogram conversion model directly converts the source spectrogram to the target speaker framewisely. The local speaking style is extracted by an acoustic encoder while the speaker identity is eliminated by a conditional convolutional disentangler. The local style encodings are time-aligned with the text encodings by the attention mechanism. The attention contexts are decoded by a conditional recurrent decoder. The experiment shows that the speaker identity of the source speech is converted to the target speaker while the local speaking style (e.g., prosody) is preserved after the augmentation. The method is compared to the augmentation model with typical statistical parameter speech synthesis (SPSS) with pre-aligned phoneme duration. The result shows that the converted speech has better speech naturalness than the SPSS system, while the speaker similarities of the converted speech are close.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Data augmentation based non-parallel voice conversion with frame-level speaker disentangler

Abstract

Talk to us

Similar Papers

More From: Speech Communication

Lead the way for us

Journal: Speech Communication	Publication Date: Nov 10, 2021
Citations: 6

Similar Papers

MASS: Multi-task anthropomorphic speech synthesis framework
Jinyin Chen ... Zhaoyan Ming
Computer Speech & Language | VOL. 70
Jinyin Chen, et. al.Jinyin Chen ... Zhaoyan Ming
21 May 2021
Computer Speech & Language | VOL. 70

Transformation of prosody in voice conversion
Berrak Sisman ... Haizhou Li
-
Berrak Sisman, et. al.Berrak Sisman ... Haizhou Li
01 Dec 2017
01 Dec 2017

An approach to voice conversion using feature statistical mapping
M.M Hasan ... S Sultana
Applied Acoustics | VOL. 66
M.M Hasan, et. al.M.M Hasan ... S Sultana
13 Nov 2004
Applied Acoustics | VOL. 66

A comparative study of voice conversion techniques: A review
Kadria Ezzine ... Mondher Frikha
-
Kadria Ezzine, et. al.Kadria Ezzine ... Mondher Frikha
01 May 2017
01 May 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data augmentation based non-parallel voice conversion with frame-level speaker disentangler

Abstract

Talk to us

Similar Papers

More From: Speech Communication