Abstract

Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on an adversarially learned inference (ALI) model which summarizes the variability within the Gaussian mixture model (GMM) distribution through a nonlinear process. Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the proposed ALI-based model is trained to generate the GMM supervector according to the maximum likelihood criterion given the Baum–Welch statistics of the input utterance. However, to prevent the potential loss of information caused by the Kullback–Leibler divergence (KL divergence) regularization adopted in the VAE-based model training, the newly proposed ALI-based feature extractor exploits a joint discriminator to ensure that the generated latent variable and the GMM supervector are more realistic. The proposed framework is compared with the conventional i-vector and VAE-based methods using the TIDIGITS dataset. Experimental results show that the proposed method can represent the uncertainty caused by the short duration better than the VAE-based method. Furthermore, the proposed approach has shown great performance when applied in association with the standard i-vector framework.

Highlights

  • Over the last decade, the i-vector framework [1,2] has become one of the most dominant approaches for feature extraction in text-independent speaker recognition

  • In order to ensure that the generated latent variable z matches its prior distribution and the Gaussian mixture model (GMM) supervector mwell preserves the distributive structure of the GMM driven from the universal background model (UBM), the proposed scheme utilizes a joint discriminator for regularizing the encoder and decoder parameters

  • Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the architecture proposed in this paper is composed of an encoder and a decoder network where the former estimates the distribution of the latent variable given the speech and the latter generates the GMM from the latent variable

Read more

Summary

Introduction

The i-vector framework [1,2] has become one of the most dominant approaches for feature extraction in text-independent speaker recognition. The latent variable of the VAE-based feature extraction model in [17] serves as an utterance-level representation and has shown better performance than the conventional i-vector in coping with the uncertainty caused by the short duration. As mentioned in [20,21], since the Kullback–Leibler divergence (KL divergence) regularization term in the VAE objective function focuses on stretching the latent variable space over the entire training set to avoid assigning a small probability to any samples, the VAE output often lacks variety This may lead the VAE-based feature extractor to yield utterance-level features which lack distinct speaker-dependent information, resulting in limited performance of the overall speaker recognition system.

I-Vector Framework
Adversarially Learned Inference
Adversarially Learned Feature Extraction
Maximum Likelihood Criterion
Adversarially Learned Inference for Nonlinear I-Vector Extraction
Relationship to the VAE-Based Feature Extractor
Databases
Experimental Setup
Effect of the Duration on the Latent Variable
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call