Abstract

Abstract In this paper, we propose the use of cluster adaptive training (CAT) weights as features in support vector machine (SVM) based text-independent verification task. The speaker utterance is characterized by a vector of cluster weights, which are extracted during the cluster adaptive training process. The effects of the number of classes, which are obtained by partitioning the components of the model, and the number of clusters on the verification performance are investigated. To remove session variability due to influences of microphone, environment, etc, Nuisance Attribute Projection (NAP) is also evaluated. Experimental results in a NIST SRE 2006 task show that this CAT weights SVM system achieves comparable performance to a state-of-the-art cepstral GMM-UBM verification system, and their fusion can give further performance gains. Index Terms : CAT, SVM, NAP, GMM-UBM, fusion 1. Introduction For the task of text-independent speaker verification, the most prevalent framework is the Gaussian Mixture Model – Universal Background Model (GMM-UBM) framework [1], where the speaker model is constructed by Maximum a Posterior (MAP) adaptation of the means of the UBM. In recent years, lots of alternate speaker modeling methods have been proposed. Among these techniques, reference clusters or speakers based adaptation methods (e.g. Clustering Adaptive Training [2], Eigenvoice modeling [3], Reference Speaker Weighting [4], Anchor modeling [5], etc.) are studied extensively both in speech recognition [2, 3, 4] and speaker recognition [5, 6]. In the reference clusters based method, a model is built for each cluster, and then a new speaker model is constructed by a linear interpolation of all the cluster parameters. The aim of this method is to map the enrolled speaker to a new space expanded by reference clusters, in which there may be different discriminative capability. Support vector machines (SVMs) have become one of the most popular classification techniques for speaker recognition, e.g. [8]–[11]. SVMs work on a high-dimensional feature space which is derived by a nonlinear mapping of the input space. To address the performance degradation introduced by session variability (e.g. microphone, environment, etc.), Nuisance Attribute Projection was developed in [10] to remove dimensions from the SVM expansion space that are irrelevant to the classification problem. In this study, we investigated the use of CAT weights as features in SVM based speaker verification. In this method, the characteristic of a speaker is modeled using the CAT weights, which are stacked into a vector in the space spanned by a set of pre-selected reference clusters. NAP is then applied to reduce session variability in the cluster weight vectors. Fusion of this new system with conventional GMM-UBM system is also investigated in this study. The remainder of this paper is organized as follows. In Section 2, we will review cluster adaptive training. In Section 3, we present the use of vectors of CAT weights vectors in SVM based speaker verification. In Section 4, we report experimental results in a NIST speaker recognition evaluation (SRE) 2006 task. Conclusion will be given at the end of this paper.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call