Modelling and compensation for language mismatch in speaker verification

Abhinav Misra,John H.L Hansen

doi:10.1016/j.specom.2017.09.004

Abhinav Misra, John H.L Hansen

Open Access

https://doi.org/10.1016/j.specom.2017.09.004

Copy DOI

Journal: Speech Communication	Publication Date: Nov 8, 2017
Citations: 10	License type: publisher-specific-oa

Affiliation: The University of Texas at Dallas

Abstract

Language mismatch represents one of the more difficult challenges in achieving effective speaker verification in naturalistic audio streams. The portion of bi-lingual speakers worldwide continues to grow making speaker verification for speech technology more difficult. In this study, three specific methods are proposed to address this issue. Experiments are conducted on the PRISM (Promoting Robustness in Speaker Modeling) evaluation-set. We first show that adding small amounts of multi-lingual seed data to the Probabilistic Linear Discriminant Analysis (PLDA) development set, leads to a significant relative improvement of +17.96% in system Equal Error Rate (EER). Second, we compute the eigendirections that represent the distribution of multi-lingual data added to PLDA. We show that by adding these new eigendirections as part of the Linear Discriminant Analysis (LDA), and then minimizing them to directly compensate for language mismatch, further performance gains for speaker verification are achieved. By combining both multi-lingual PLDA and this minimization step with the new set of eigendirections, we obtain a +26.03% relative improvement in EER. In practical scenarios, it is highly unlikely that multi-lingual seed data representing the languages present in the test-set would be available. Hence, in the third phase, we address such scenarios, by proposing a method for Locally Weighted Linear Discriminant Analysis (LWLDA). In this third method, we reformulate the LDA equations to incorporate a local affine transform that weighs the same speaker samples. This method effectively preserves the local intrinsic information represented by the multimodal structure of the within-speaker scatter matrix, thereby helping to improve the class discriminating ability of LDA. It also helps in extending the ability of LDA to transform the speaker i-Vectors to dimensions that are greater than the total number of speaker classes. Using LWLDA, a relative improvement of +8.54% is obtained in system EER. LWLDA provides even more gain when multi-lingual seed data is available, and improves the system peformance by relative +26.03% in terms of EER. We also compare LWLDA to the recently proposed Nearest Neighbor Non-Parametric Discriminant Analysis (NDA). We show that not only is LWLDA better than NDA in terms of system performance but is also computationally less expensive. Comparative studies on DARPA Robust Automatic Transcription of Speech (RATS) corpus also show that LWLDA consistently outperforms NDA and LDA on different evaluation conditions. Our solutions offer new directions for addressing a challenging problem which has received limited attention in the speaker recognition community.

Full Text