Abstract

This paper presents methods for supervised and unsupervised speaker adaptation of Gaussian mixture speaker models in text-independent speaker verification. The methods are based on an approach which is able to decompose speaker and channel variability so that progressive updating of speaker models can be performed while minimizing the influence of the channel variability associated with the adaptation utterances. This approach relies on a joint factor analysis model of intrinsic speaker variability and session variability where inter-session variation is assumed to result primarily from the effects of the channel [1]. These adaptation methods have been evaluated under the adaptation paradigm defined under the NIST 2005 speaker recognition evaluation plan which is based on conversational telephone speech [2]. It was found that when both target speaker model training and speaker verification trials were performed using a five minute excerpt from a single conversation, an equal error rate (EER) of 4.5% and minimum detection cost function (DCF) of 0.013 were obtained when performing unsupervised speaker adaptation during evaluation. It will be shown that this performance is comparable to that obtained by state of the art speaker verification systems that rely on a larger set of features and are trained from as many as eight conversations from the target speaker.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.