SpeechNAS: Towards Better Trade-Off Between Latency and Accuracy for Large-Scale Speaker Verification

Wentao Zhu,Shun Lu,Sen Yang,Tianlong Kong,Xiaorui Wang,Dawei Zhang,Jixiang Li,Ji Liu,Feng Deng

doi:10.1109/asru51503.2021.9688017

Abstract

Recently, x-vector [1] has been a successful and popular approach for speaker verification, which employs a time delay neural network (TDNN) and statistics pooling to extract speaker characterizing embedding from variable-length utterances. Improvement upon the x-vector has been an active research area, and enormous neural networks have been elaborately designed based on the x-vector, e.g., extended TDNN (E-TDNN) [2], factorized TDNN (F-TDNN) [3], and densely connected TDNN (D-TDNN) [4]. In this work, we try to identify the optimal architectures from a TDNN based search space employing neural architecture search (NAS), named SpeechNAS. Leveraging the recent advances in the speaker recognition, such as high-order statistics pooling, multi-branch mechanism, D-TDNN and angular additive margin softmax (AAM) loss with a minimum hyper-spherical energy (MHE), SpeechNAS automatically discovers five network architectures, from SpeechNAS-1 to SpeechNAS-5, of various numbers of parameters and GFLOPs on the large-scale text-independent speaker recognition dataset VoxCelebl. Our derived best neural network achieves an equal error rate (EER) of 1.02% on the standard test set of VoxCelebl, which surpasses previous TDNN based state-of-the-art approaches by a large margin.

Full Text