Abstract

We present a thorough analysis of the systems developed by the JHU-MIT consortium in the context of NIST speaker recognition evaluation 2018. In the previous NIST evaluation, in 2016, i-vectors were the speaker recognition state-of-the-art. However now, neural network embeddings (a.k.a. x-vectors) rise as the best performing approach. We show that in some conditions, x-vectors’ detection error reduces by 2 w.r.t. i-vectors. In this work, we experimented on the Speakers In The Wild evaluation (SITW), NIST SRE18 VAST (Video Annotation for Speech Technology), and SRE18 CMN2 (Call My Net 2, telephone Tunisian Arabic) to compare network architectures, pooling layers, training objectives, back-end adaptation methods, and calibration techniques. x-Vectors based on factorized and extended TDNN networks achieved performance without parallel on SITW and CMN2 data. However for VAST, performance was significantly worse than for SITW. We noted that the VAST audio quality was severely degraded compared to the SITW, even though they both consist of Internet videos. This degradation caused strong domain mismatch between training and VAST data. Due to this mismatch, large networks performed just slightly better than smaller networks. This also complicated VAST calibration. However, we managed to calibrate VAST by adapting SITW scores distribution to VAST, using a small amount of in-domain development data.Regarding pooling methods, learnable dictionary encoder performed the best. This suggests that representations learned by x-vector encoders are multi-modal. Maximum margin losses were better than cross-entropy for in-domain data but not in VAST mismatched data. We also analyzed back-end adaptation methods in CMN2. PLDA semi-supervised adaptation and adaptive score normalization (AS-Norm) yielded significant improvements. However, results were still worse than in English in-domain conditions like SITW.We conclude that x-vectors have become the new state-of-the-art in speaker recognition. However, their advantages reduce in cases of strong domain mismatch. We need to investigate domain adaptation and domain invariant training approaches to improve performance in all conditions. Also, speech enhancement techniques with a focus on improving the speaker recognition performance could be of great help.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.