Abstract

The ability to generalize to mismatches between training and testing conditions and resist interference from other speakers is crucial for the performance of speaker verification. In this paper, we propose two novel approaches to improve the generalization ability to deal with the mismatched recorded scenarios and languages in test conditions and to reduce the influence of interference from other speakers on the similarity measurement of two speaker embeddings. First, parent embedding learning (PEL) is used for model training, which exploits the generalization ability of the shared structure to improve the representation of speaker embeddings. Second, partial adaptive score normalization (PAS-Norm) is used to reduce the influence of interference from other speakers on embedding-based similarity measures. In the experiments, the speaker embedding models are trained using the VoxCeleb2 dataset, and the performance is evaluated on four other datasets under different conditions, including VoxCeleb1, Librispeech, SITW, and CN-Celeb datasets. In the experiments on VoxCeleb1, evaluation results considering a large number of verification speakers and identity restrictions show that the proposed PEL-based system reduces the EER by 6.0% and 4.9% in these two cases, respectively, compared to the state-of-the-art (SOTA) system. Furthermore, in the experiments evaluating speaker verification in mismatch conditions on SITW and CN-Celeb, the proposed PEL-based system also outperforms the SOTA system. In the language mismatched conditions, the EER is reduced by 8.3%. For the evaluation of the influence of interference from other speakers, the EER is significantly reduced by 24.4% when PAS-Norm is used instead of the baseline AS-Norm score normalization method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call