Abstract

In recent years, the i-vector based framework has been proven to provide state-of-the-art performance in the speaker verification field. Each utterance is projected onto a total factor space and is represented by a low-dimensional i-vector. However, the degradation of performance in the i-vector space remains problematic and is commonly attributed to channel variability. Most techniques used for the channel compensation of the i-vectors, such as linear discriminant analysis (LDA) or probabilistic linear discriminant analysis (PLDA) aim to compensate for the variabilities caused by channel effects. However, in real-world applications, the duration of enrollment and test utterances by each user (speaker) are always very limited. In this paper, we demonstrate, from both analytical and experimental perspectives, that feature sparsity and imbalance widely exist in short utterances, in which case the conventional i-vector extraction algorithm, based on maximum likelihood estimation (MLE), may lead to over-fitting and decrease the performance of the speaker verification system, especially for short utterances. This prompted us to propose an improved i-vector extraction algorithm, which we term adaptive first-order Baum–Welch statistics analysis (AFSA). This new algorithm suppresses and compensates for the deviation from first-order Baum–Welch statistics caused by feature sparsity and imbalance.We reported results on the male telephone portion of the core trial condition (short2-short3) and other short time trial conditions (short2-10sec and 10sec-10sec) on NIST 2008 Speaker Recognition Evaluations (SREs) dataset. As measured both by Equal Error Rate (EER) and the minimum values of the NIST Detection Cost Function (minDCF), 10%–15% relative improvement is obtained compared to the baseline of traditional i-vector based system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call