Discriminative In-Set/Out-of-Set Speaker Recognition

Pongtep Angkititrakul,John H L Hansen

doi:10.1109/tasl.2006.881689

Abstract

In this paper, the problem of identifying in-set versus out-of-set speakers for limited training/test data durations is addressed. The recognition objective is to form a decision regarding an input speaker as being a legitimate member of a set of enrolled speakers or outside speakers. The general goal is to perform rapid speaker model construction from limited enrollment and test size resources for in-set testing for input audio streams. In-set detection can help ensure security and proper access to private information, as well as detecting and tracking input speakers. Areas of applications of these concepts include rapid speaker tagging and tracking for information retrieval, communication networks, personal device assistants, and location access. We propose an integrated system with emphasis on short-enrollment data (about 5 s of speech for each enrolled speaker) and test data (2-8 s) within a text-independent mode. We present a simple and yet powerful decision rule to accept or reject speakers using a discriminative vector in the decision score space, together with statistical hypothesis testing based on the conventional likelihood ratio test. Discriminative training is introduced to further improve system performance for both decision techniques, by employing minimum classification error and minimum verification error frameworks. Experiments are performed using three separate corpora. Using the YOHO speaker recognition database, the alternative decision rule achieves measurable improvement over the likelihood ratio test, and discriminative training consistently enhances overall system performance with relative improvements ranging from 11.26%-28.68%. A further extended evaluation using the TIMIT (CORPUS1) and actual noisy aircraft communications data (CORPUS2) shows measurable improvement over the traditional MAP based scheme using the likelihood ratio test (MAP-LRT), with average EERs of 9%-23% for TIMIT and 13%-32% for noisy aircraft communications. The results confirm that an effective in-set/out-of-set speaker recognition system can be formulated using discriminative training for rapid tagging of input speakers from limited training and test data sizes

Full Text