State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations

Jesús Villalba,Nanxin Chen,David Snyder,Daniel Garcia-Romero,Alan Mccree,Gregory Sell,Jonas Borgstrom,Leibny Paola García-Perera,Fred Richardson,Réda Dehak,Pedro A Torres-Carrasquillo,Najim Dehak

doi:10.1016/j.csl.2019.101026

Abstract

We present a thorough analysis of the systems developed by the JHU-MIT consortium in the context of NIST speaker recognition evaluation 2018. In the previous NIST evaluation, in 2016, i-vectors were the speaker recognition state-of-the-art. However now, neural network embeddings (a.k.a. x-vectors) rise as the best performing approach. We show that in some conditions, x-vectors’ detection error reduces by 2 w.r.t. i-vectors. In this work, we experimented on the Speakers In The Wild evaluation (SITW), NIST SRE18 VAST (Video Annotation for Speech Technology), and SRE18 CMN2 (Call My Net 2, telephone Tunisian Arabic) to compare network architectures, pooling layers, training objectives, back-end adaptation methods, and calibration techniques. x-Vectors based on factorized and extended TDNN networks achieved performance without parallel on SITW and CMN2 data. However for VAST, performance was significantly worse than for SITW. We noted that the VAST audio quality was severely degraded compared to the SITW, even though they both consist of Internet videos. This degradation caused strong domain mismatch between training and VAST data. Due to this mismatch, large networks performed just slightly better than smaller networks. This also complicated VAST calibration. However, we managed to calibrate VAST by adapting SITW scores distribution to VAST, using a small amount of in-domain development data.Regarding pooling methods, learnable dictionary encoder performed the best. This suggests that representations learned by x-vector encoders are multi-modal. Maximum margin losses were better than cross-entropy for in-domain data but not in VAST mismatched data. We also analyzed back-end adaptation methods in CMN2. PLDA semi-supervised adaptation and adaptive score normalization (AS-Norm) yielded significant improvements. However, results were still worse than in English in-domain conditions like SITW.We conclude that x-vectors have become the new state-of-the-art in speaker recognition. However, their advantages reduce in cases of strong domain mismatch. We need to investigate domain adaptation and domain invariant training approaches to improve performance in all conditions. Also, speech enhancement techniques with a focus on improving the speaker recognition performance could be of great help.

Full Text