Variational Bayesian Joint Factor Analysis Models for Speaker Verification

Xianyu Zhao,Yuan Dong

doi:10.1109/tasl.2011.2170972

Abstract

Joint factor analysis (JFA) is a recently developed method to model speaker and session variability in Gaussian Mixture Models (GMMs). In this paper, both batch and sequential Bayesian analysis of JFA models are evaluated for robust speaker recognition. Various sources of uncertainties in JFA models, from latent speaker and channel factors to Gaussian mixture indicator variables, are examined from a Bayesian perspective. By integrating over all these latent factors, we could better account for the sources of variability in speaker enrollment and verification processes than considering only point estimates; through this study, we could also analyze and identify the contribution of these various underlying model uncertainties to the final speaker verification performance. However, as all latent variables in JFA GMM become correlated with each other given observed data, it becomes practically intractable to do Bayesian analysis in closed analytic form. Hence, an alternative approach based on variational Bayes is developed in this paper to explore Bayesian JFA models in an approximate yet efficient way. In this method, fully correlated a posteriori distribution is approximated by a variational distribution of factored form to facilitate inference; and a lower bound on model likelihood is also derived to construct detection scores. Experimental results on the 2008 NIST Speaker Recognition Evaluation (NIST SRE) show that these variational Bayesian JFA models could obtain significant performance improvements over JFA using point estimates, especially for the cases with limited enrollment and test data. For the 10-s task in the 2008 NIST SRE, the variational Bayesian JFA systems obtained relatively 9.4% EER and 11.5% DCF reductions compared to the baseline JFA system. This paper also shows the importance of taking into account the uncertainties in both speaker and channel factors, which is more effective than considering uncertainties in channel factors alone.

Full Text