Abstract

In this paper, we explore joint factor analysis (JFA) for text-dependent speaker recognition with random digit strings. The core of the proposed method is a JFA model by which we extract features. These features can either represent overall utterances or individual digits, and are fed into a trainable backend to estimate likelihood ratios. Within this framework, several extensions are proposed. First is a logistic regression method for combining log-likelihood ratios that correspond to individual mixture components. Second is the extraction of phonetically aware Baum--Welch statistics, by using forced alignment instead of the typical posterior probabilities that are derived by the universal background model. We also explore a digit-string-dependent way to apply score normalization that exhibits a notable improvement compared to the standard one. By fusing six JFA features, we attained 2.01% and 3.19% equal error rates on male and female, respectively, on the challenging RSR2015 (part III) dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.