Abstract

Gaussian mixture model (GMM) based approaches have been commonly used for speaker recognition tasks. Methods for estimation of parameters of GMMs include the expectation-maximization method which is a non-discriminative learning based method and the large margin method which is a discriminative learning based method. Discriminative classifier based approaches to speaker recognition include support vector machine (SVM) based classifiers using dynamic kernels such as generalized linear discriminant sequence kernel, probabilistic sequence kernel, GMM supervector kernel and Bhattacharyya distance based kernel . Recently, the intermediate matching kernel (IMK) has been proposed as a dynamic kernel for recognition of objects in an image represented using a set of local feature vectors. The IMK-based SVMs give a better performance than the state-of-the-art GMM-based approaches for speaker identification tasks, because they are well suited for meeting the basic challenge of providing reliable scores of intra-speaker variation of suspects and scores of inter-speaker variation of the potential population which is crucial to law enforcement and counter terrorism agencies in evaluating the strength of the evidence at hand. Thus, the IMK-based SVMs can be used to build the speaker recognition models in the FSR (forensic speaker recognition) systems. However, it is necessary to develop techniques to determine the strength of evidence from the outputs of SVM-based models. The SVM-based models are trained using discriminative methods and their generalization ability is good. We propose to use the IMK-based SVM classifier for speaker identification from the speech signal of an utterance represented as a set of local feature vectors. The main issue in building the IMK-based SVM classifier is selection of the virtual feature vectors using which the local feature vectors from the representations of two different utterances are matched. We explore the use of components of universal background GMM as the set of virtual feature vectors. We compare the performance of the GMM-based approaches and the dynamic kernel SVM-based approaches to speaker identification. The 2002 and 2003 NIST speaker recognition corpora are used in evaluation of different approaches to speaker identification. Results of our studies show that the dynamic kernel SVM-based approaches give a significantly better performance than the GMM-based approaches. For speaker identification task, the IMK-based SVM gives a performance that is comparable to that of SVMs using any of the other dynamic kernels. The storage requirements and the computational complexity of the IMK-based SVMs are less than of SVMs using any of the other dynamic kernels.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call