Abstract

Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). In this paper, we propose a novel fully supervised training method for extracting a speaker embedding vector disentangled from the variability caused by the nuisance attributes. The proposed framework was compared with the conventional deep learning-based embedding methods using the RSR2015 and VoxCeleb1 dataset. Experimental results show that the proposed approach can extract speaker embeddings robust to channel and emotional variability.

Highlights

  • Speaker verification is the task of verifying the claimed speaker identity based on the given speech samples and has become a key technology for personal authentication in many commercial applications, forensics and law enforcement [1]

  • The results demonstrate that the proposed joint factor embedding (JFE) is composed of a simple x-vector-like network, it can provide embedding with higher speaker discriminative information than the systems with more complicated architecture

  • In this paper, a novel approach for extracting an embedding vector robust to variability caused by nuisance attributes for speaker verification is proposed

Read more

Summary

INTRODUCTION

Speaker verification is the task of verifying the claimed speaker identity based on the given speech samples and has become a key technology for personal authentication in many commercial applications, forensics and law enforcement [1]. For extracting a channel-robust embedding for speaker verification, Lmain would be the speaker cross-entropy Lspkr defined in (4), and Lsub would be the channel cross-entropy which can be computed as follows: M. m=1 where M is the number of different channels (e.g., recording devices) in the training set, rm and rm(ω) are the mth component of the one-hot channel label r and channel classifier’s softmax output r(ω), respectively. M=1 where M is the number of different channels (e.g., recording devices) in the training set, rm and rm(ω) are the mth component of the one-hot channel label r and channel classifier’s softmax output r(ω), respectively Another way to achieve disentanglement is by training the embedding network and the subtask network in a competitive manner via adversarial training [22]. Once the embedding vectors are extracted, both ωspkr and ωnuis are fed into the speaker and nuisance classification networks

TRAINING FOR JOINT FACTOR EMBEDDING
2) EXPERIMENTAL SETUP
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call