I-vector Model Research Articles

Text-independent speaker recognition using short utterances is a highly challenging task due to the large variation and content mismatch between short utterances. I-vector and probabilistic linear discriminant analysis (PLDA) based systems have become the standard in speaker verification applications, but they are less effective with short utterances. In this paper, we first compare two state-of-the-art universal background model (UBM) training methods for i-vector modeling using full-length and short utterance evaluation tasks. The two methods are Gaussian mixture model (GMM) based (denoted I-vector_GMM) and deep neural network (DNN) based (denoted as I-vector_DNN) methods. The results indicate that the I-vector_DNN system outperforms the I-vector_GMM system under various durations (from full length to 5 s). However, the performances of both systems degrade significantly as the duration of the utterances decreases. To address this issue, we propose two novel nonlinear mapping methods which train DNN models to map the i-vectors extracted from short utterances to their corresponding long-utterance i-vectors. The mapped i-vector can restore missing information and reduce the variance of the original short-utterance i-vectors. The proposed methods both model the joint representation of short and long utterance i-vectors: the first method trains an autoencoder first using concatenated short and long utterance i-vectors and then uses the pre-trained weights to initialize a supervised regression model from the short to long version; the second method jointly trains the supervised regression model with an autoencoder reconstructing the short utterance i-vector itself. Experimental results using the NIST SRE 2010 dataset show that both methods provide significant improvement and result in a 24.51% relative improvement in Equal Error Rates (EERs) from a baseline system. In order to learn a better joint representation, we further investigate the effect of a deep encoder with residual blocks, and the results indicate that the residual network can further improve the EERs of a baseline system by up to 26.47%. Moreover, in order to improve the short i-vector mapping to its long version, an additional vector, which represents the average value of phoneme posteriors across frames, is also added to the input, and results in a 28.43% improvement. When further testing the best-validated models of SRE10 on the Speaker In The Wild (SITW) dataset, the methods result in a 23.12% improvement on arbitrary-duration (1–5 s) short-utterance conditions.

This paper presents a simplified and supervised i-vector modeling approach with applications to robust and efficient language identification and speaker verification. First, by concatenating the label vector and the linear regression matrix at the end of the mean supervector and the i-vector factor loading matrix, respectively, the traditional i-vectors are extended to label-regularized supervised i-vectors. These supervised i-vectors are optimized to not only reconstruct the mean supervectors well but also minimize the mean square error between the original and the reconstructed label vectors to make the supervised i-vectors become more discriminative in terms of the label information. Second, factor analysis (FA) is performed on the pre-normalized centered GMM first order statistics supervector to ensure each gaussian component's statistics sub-vector is treated equally in the FA, which reduces the computational cost by a factor of 25 in the simplified i-vector framework. Third, since the entire matrix inversion term in the simplified i-vector extraction only depends on one single variable (total frame number), we make a global table of the resulting matrices against the frame numbers’ log values. Using this lookup table, each utterance's simplified i-vector extraction is further sped up by a factor of 4 and suffers only a small quantization error. Finally, the simplified version of the supervised i-vector modeling is proposed to enhance both the robustness and efficiency. The proposed methods are evaluated on the DARPA RATS dev2 task, the NIST LRE 2007 general task and the NIST SRE 2010 female condition 5 task for noisy channel language identification, clean channel language identification and clean channel speaker verification, respectively. For language identification on the DARPA RATS, the simplified supervised i-vector modeling achieved 2%, 16%, and 7% relative equal error rate (EER) reduction on three different feature sets and sped up by a factor of more than 100 against the baseline i-vector method for the 120s task. Similar results were observed on the NIST LRE 2007 30s task with 7% relative average cost reduction. Results also show that the use of Gammatone frequency cepstral coefficients, Mel-frequency cepstral coefficients and spectro-temporal Gabor features in conjunction with shifted-delta-cepstral features improves the overall language identification performance significantly. For speaker verification, the proposed supervised i-vector approach outperforms the i-vector baseline by relatively 12% and 7% in terms of EER and norm old minDCF values, respectively.

I-vector Model Research Articles

Related Topics

Articles published on I-vector Model

Empirical Comparison between Deep and Classical Classifiers for Speaker Verification in Emotional Talking Environments

The Effect of Speech Fragmentation and Audio Encodings on Automatic Parkinson’s Disease Recognition

Novel feature representation using single frequency filtering and nonlinear energy operator for speech emotion recognition

Speaker Recognition: Progression and challenges

Speaker forensic identification using joint factor analysis and i-vector

Estimating Uniqueness of I-Vector-Based Representation of Human Voice

ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

I-vector Evaluation of Electrocardiogram (ECG) Biometric Identification System based on Sequential Compensation Approach

Supervised I-vector modeling for language and accent recognition

Deep neural network based i-vector mapping for speaker verification using short utterances

Emotional speaker recognition in real life conditions using multiple descriptors and i-vector speaker modeling technique

I-Vector-Based Speaker Verification on Limited Data Using Fusion Techniques

Client-wise cohort set selection by combining speaker- and phoneme-specific I-vectors for speaker verification

Template-matching for text-dependent speaker verification

Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition

Rapid Language Identification

A fast and scalable hybrid FA/PPCA-based framework for speaker recognition

Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

I-vector Model Research Articles

Related Topics

Articles published on I-vector Model

Empirical Comparison between Deep and Classical Classifiers for Speaker Verification in Emotional Talking Environments

The Effect of Speech Fragmentation and Audio Encodings on Automatic Parkinson’s Disease Recognition

Novel feature representation using single frequency filtering and nonlinear energy operator for speech emotion recognition

Speaker Recognition: Progression and challenges

Speaker forensic identification using joint factor analysis and i-vector

Estimating Uniqueness of I-Vector-Based Representation of Human Voice

ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

I-vector Evaluation of Electrocardiogram (ECG) Biometric Identification System based on Sequential Compensation Approach

Supervised I-vector modeling for language and accent recognition

Deep neural network based i-vector mapping for speaker verification using short utterances

Emotional speaker recognition in real life conditions using multiple descriptors and i-vector speaker modeling technique

I-Vector-Based Speaker Verification on Limited Data Using Fusion Techniques

Client-wise cohort set selection by combining speaker- and phoneme-specific I-vectors for speaker verification

Template-matching for text-dependent speaker verification

Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition

Rapid Language Identification

A fast and scalable hybrid FA/PPCA-based framework for speaker recognition

Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification