Abstract
Most speaker recognition systems rely on short-term acoustic cepstral features for extracting the speaker-relevant information from the signal. But phonetic discriminant features, extracted by a bottle-neck multi-layer perceptron (MLP) on longer stretches of time, can provide a complementary information and have been adopted in speech transcription systems. We compare the speaker verification performance using cepstral features, discriminant features, and a concatenation of both followed by a dimension reduction. We consider two speaker recognition systems, one based on maximum likelihood linear regression (MLLR) super-vectors and the other on a state-of-the-art i-vector system with two session variability compensation schemes. Experiments are reported on a standard configuration of NIST SRE 2008 and 2010 databases. The results show that the phonetically discriminative MLP features retain speaker-specific information which is complementary to the short-term cepstral features. The performance improvement is obtained with both score domain and feature domain fusion and the speaker verification equal error rate (EER) is reduced up to 50% relative, compared to the best i-vector system using only cepstral features.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.