Abstract

Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for embedding the speaker variability in a non-linear fashion have shown impressive performance in various speaker verification tasks. However, since most of these techniques are trained in a supervised manner, which requires speaker labels for the training data, it is difficult to use them when a scarce amount of labeled data is available for training. In this paper, we propose a novel technique for extracting an i-vector-like feature based on the variational autoencoder (VAE), which is trained in an unsupervised manner to obtain a latent variable representing the variability within a Gaussian mixture model (GMM) distribution. The proposed framework is compared with the conventional i-vector method using the TIDIGITS dataset. Experimental results showed that the proposed method could cope with the performance deterioration caused by the short duration. Furthermore, the performance of the proposed approach improved significantly when applied in conjunction with the conventional i-vector framework.

Highlights

  • Speaker verification is the task of verifying the claimed speaker identity in the input speech.The speaker verification process is composed of three steps: acoustic feature extraction, utterance-level feature extraction, and scoring

  • We present the experimental results obtained from a speaker recognition task where the decision was made by fusing the probabilistic linear discriminant analysis (PLDA) scores of i-vector features and variational autoencoder (VAE)-based features

  • In order to capture the variability that has not been fully represented by the linear projection in the traditional i-vector framework, we designed a VAE for Gaussian mixture model (GMM)

Read more

Summary

Introduction

Speaker verification is the task of verifying the claimed speaker identity in the input speech. Since most classification or verification algorithms operate on fixed dimensional vectors, acoustic features cannot be directly used with such methods due to the variable duration property of speech [1]. The random digit string task highlights one of the most serious causes of feature uncertainty, which is the short duration of the given speech samples [15]. In contrast to the conventional deep learning-based feature extraction techniques, which take the acoustic features as input, the proposed approach exploits the resources used in the conventional i-vector scheme It is interesting that a dramatic performance improvement was observed when the features extracted from the proposed method and the conventional i-vector were augmented together. This indicates that the newly proposed feature and the conventional i-vector are complementary to each other

I-Vector Framework
Variational Autoencoder
Variational Inference Model for Non-Linear Total Variability Embedding
Maximum Likelihood Training
Non-Linear Feature Extraction and Speaker Verification
Databases
Experimental Setup
Effect of the Duration on the Latent Variable
Experiments with VAEs
Feature-Level Fusion of I-Vector and Latent Variable
Score-Level Fusion of I-Vector and Latent Variable
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call