The past decade has witnessed a significant improvement in speaker recognition SR technology in terms of performance with the introduction of the i-vectors framework. Despite these advances, the performance of SR systems considerably suffers in the presence of acoustic nuisances and variabilities. In this paper, we develop a data-driven nuisance compensation technique in the i-vector space without referring to the effects of the targeted nuisances in the temporal domain. This approach is nonparametric as it does not suppose a specific relationship between a “good” version of an i-vector and its corrupted version. Instead, our algorithm models directly the joint distribution of both representations the good i-vector and its corrupted version and takes advantage of the reproducibility of acoustic corruptions to generate the corrupted i-vectors. We then build an MMSE estimator that computes an improved version of a corrupted test i-vector, given this joint distribution. Experiments are carried out on NIST SRE 2010 and speakers in the wild databases where the proposed algorithm is used to deal with additive noise and short utterances. Our technique is shown to be efficient, improving the baseline system performance in terms of equal-error rate by up to 70% when used on known test noises and up to 65% in the context of unseen noises using a generic model. It was also proven efficient in the context of duration mismatch reaching up to 40% of relative improvement when used on short utterances using multiple models corresponding to different durations and up to 36% when used on arbitrary duration test segments.
Read full abstract