Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder

Hongqiang Du,Lei Xie

doi:10.21437/interspeech.2021-2132

Abstract

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning
Yi Ting Chen ... Wanting Li
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Yi Ting Chen, et. al.Yi Ting Chen ... Wanting Li
10 May 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation
Yi Zhou ... Haizhou Li
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29
Yi Zhou, et. al.Yi Zhou ... Haizhou Li
01 Jan 2020
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29

Zero-Shot Voice Conversion with Adjusted Speaker Embeddings and Simple Acoustic Features
Zhiyuan Tan ... Yuqing He
-
Zhiyuan Tan, et. al.Zhiyuan Tan ... Yuqing He
06 Jun 2021
06 Jun 2021

A frame mapping based HMM approach to cross-lingual voice transformation
Yao Qian ... Ji Xu
-
Yao Qian, et. al.Yao Qian ... Ji Xu
01 May 2011
01 May 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder

Abstract

Talk to us

Similar Papers