Abstract

In this article, we conduct a comprehensive simulation study for the optimal scores of speaker recognition systems that are based on speaker embedding. For that purpose, we first revisit the optimal scores for the speaker identification (SI) task and the speaker verification (SV) task in the sense of minimum Bayes risk (MBR) and show that the optimal scores for the two tasks can be formulated as a single form of normalized likelihood (NL). We show that when the underlying model is linear Gaussian, the NL score is mathematically equivalent to the PLDA likelihood ratio (LR), and the empirical scores based on cosine distance and Euclidean distance can be seen as approximations of this linear Gaussian NL score under some conditions.Based on the unified NL score, we conducted a comprehensive simulation study to investigate the behavior of the scoring component on both the SI task and SV task, in the case where the distribution of the speaker vectors perfectly matches the assumption of the NL model, as well as the case where some mismatch is involved. Importantly, our simulation is based on the statistics of speaker vectors derived from a practical speaker recognition system, hence reflecting the behavior of the NL scoring in real-life scenarios that are full of imperfection, including non-Gaussianality, non-homogeneity, and domain/condition mismatch.

Highlights

  • With decades of investigation, speaker recognition has achieved significant performance and has been deployed in a wide range of practical applications [1,2,3]

  • 5 Conclusions We present an analysis on the optimal score for speaker recognition based on the MAP principle and the linear Gaussian assumption

  • The analysis shows that the normalized likelihood (NL) is optimal for both identification and verification tasks in the sense of minimum Bayes risk

Read more

Summary

Introduction

Speaker recognition has achieved significant performance and has been deployed in a wide range of practical applications [1,2,3]. Speaker recognition research concerns two tasks: speaker identification (SI) that identify the true speaker from a set of candidates, and speaker verification (SV) that tests if an alleged speaker is the true speaker. The performance of SI systems is evaluated by identification rate (IDR), the percentage of the trials whose speakers are correctly identified. SV systems require a threshold to decide whether accepting the speaker or not and the performance is evaluated by equal error rate (EER), to represent the trade-off between fail to accept and fail to reject. Modern speaker recognition methods are based on the concept of speaker embedding, i.e., representing speakers. A key component of the speaker embedding approach is how to score a trial.

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call