Abstract

• We proposed a variational-based regularization term to train DNN embeddings for speaker verification. • The regularization function measures the match between the embeddings and a given desired prior distribution. • The match is a non-parametric measure, which makes the system flexible for different desired priors. • Both Gaussian and Uniform distribution priors produce more suitable embeddings for standard PLDA scoring backend. In state-of-the-art text-independent speaker verification systems, a discriminative deep neural network (DNN) model learns speaker-discriminative representations (x-vectors) for utterances using labeled data. For the verification task, a Probabilistic Linear Discriminant Analysis (PLDA) model is used to decide whether two x-vectors come from the same speaker. The PLDA scoring model assumes Gaussian priors and conditional distributions across speakers, which is not the case for x-vectors. This work introduces a variational-based regularization term that encourages the network to generate embeddings that follow a desired prior distribution. The regularization function performs a non-parametric match between the embeddings generated in the mini-batch training and a sample from the desired distribution. Unlike Variational Auto Encoders (VAEs), no distribution parameter is necessary to be learned, and no sampling schema is employed, which makes the proposed method flexible for different desired distributions. Our experiments compared the proposed method with the standard x-vectors system jointly with other approaches recently proposed to generate Gaussianized representations. We assessed the verification performance of the systems using the Fisher English Training Speech - Part II database in eight test conditions based on the gender of the speakers and the duration of the speech segments. Besides using the standard Gaussian distribution as prior, we also applied a less strict distribution (uniform). The proposed method outperformed others in all test conditions for both distributions, with gains of performance between 7.68% and 20.49% when compared to the standard x-vectors. To understand the effect of the regularizations into the embeddings space, we also conducted a 2-dimensional visual comparison, which showed clusters with better quality when the regularizations were applied.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call