Deep speaker embeddings for Speaker Verification: Review and experimental comparison

Maros Jakubec,Roman Jarina,Eva Lieskovska,Peter Kasak

doi:10.1016/j.engappai.2023.107232

Abstract

The construction of speaker-specific acoustic models for automatic speaker recognition is almost exclusively based on deep neural network-based speaker embeddings. This work aims to review the recent progress in speaker embedding development and to perform an experimental benchmark experimental comparison among the state-of-the-art deep speaker representations for a Speaker Verification (SV) task. The performance evaluation of the existing and proposed models on the VoxCeleb1 benchmark database shows that the SV systems based on r-vectors with a Res2Net convolutional architecture including multi-head attention pooling and additive margin softmax outperform other solutions such as d-vectors, x-vectors and conventional r-vectors. In addition, an ensemble network is proposed that fuses the best-performing speaker embeddings. It was found that different types of embeddings can contain complementary speaker-related information. We show that a concatenation of x-vectors and r-vectors can further improve the performance of the SV system. The best-performing embedding ensemble achieves an Equal Error Rate of 2.52% within the Voxceleb1 benchmark test, which is lower than other published results and obtained on the same dataset using the standard Voxceleb1 evaluation methodology.

Full Text