Text Prompted Speaker Verification Based on Phoneme Clustering with Earth Mover's Distane and Cauchy-Schwarz Divergence

Zhuzi Chen,Jia Liu,Yi Liu

doi:10.1145/3242840.3242873

Abstract

For short duration text prompted speaker verification where the amount of enrollment data is limited for each speaker model, it is hard to obtain a robust speaker representation. In these situations of short utterance speaker verification I-vector/GMM approaches work even worse than traditional GMM-MAP modeling method. GMM/HMM framework content matching is one of the state-of-the-art paradigms for short duration text-dependent speaker verification, in which models for individual lexical such as words, syllables, or phonemes are established for the background and speaker to make up mismatch. However, some of the phonemes do not occur in enrollment but happen in the testing recordings, and most of the phonemes have different preceding and succeeding phonemes, both of which leads to coarticulation difference. These are called lexical and context mismatch. In this work, to overcome the data sparceness caused lexical mismatch and context mismatch, phoneme states are clustered applying Earth Mover's Distance and Cauchy-Schwarz divergence as metrics. Performance improved as EER lowered by 6.2%, minDCF08 lowered by 1.9% for Earth Mover's Distance metric, and EER lowered by 3.7%, minDCF08 rised 1.9% for Cauchy-Schwarz divergence metric.

Full Text