Abstract

In this paper, we introduce the concept of the effective sample size to speaker diarization and recognition. We show why the use of the nominal sample size is inadequate to feature streams that exhibit inter-frame correlations and how it adversely affects inference. We then discuss the effective sample size, that is the sample size of a set of independent observations that carry the equivalent amount of statistical information about the model parameters and how the scaling factor can be estimated. Our experiments on speaker diarization show that once the effective sample size is adopted, state-of-the-art results can be attained even with single Gaussians and Hierarchical Clustering, and even when the scaling factor is set to be common for all utterances. On speaker recognition, encouraging results are reported on NIST-2010 using iVectors and PLDA.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call