In this paper, we explore joint factor analysis (JFA) for text-dependent speaker recognition with random digit strings. The core of the proposed method is a JFA model by which we extract features. These features can either represent overall utterances or individual digits, and are fed into a trainable backend to estimate likelihood ratios. Within this framework, several extensions are proposed. First is a logistic regression method for combining log-likelihood ratios that correspond to individual mixture components. Second is the extraction of phonetically aware Baum--Welch statistics, by using forced alignment instead of the typical posterior probabilities that are derived by the universal background model. We also explore a digit-string-dependent way to apply score normalization that exhibits a notable improvement compared to the standard one. By fusing six JFA features, we attained 2.01% and 3.19% equal error rates on male and female, respectively, on the challenging RSR2015 (part III) dataset.
Read full abstract