Virtual Fully-Connected Layer for a Large-Scale Speaker Verification Dataset

Zhida Song,Liang He,Ying Hu,Zhihua Fang,Hao Huang

doi:10.1007/978-3-031-20233-9_39

Abstract

AbstractRecently, convolutional neural networks (CNNs) have been widely used in speaker verification tasks and achieved the state-of-the-art performance in most dominant datasets, such as NIST SREs, VoxCeleb, CNCeleb and etc. However, suppose the speaker classification is performed by one-hot coding, the weight shape of the last fully-connected layer is \( B \times N \), B is the min-batch size, and N is the number of speakers, which will require large GPU memory as the number of speakers increases. To address this problem, we introduce a virtual fully-connected (Virtual FC) layer in the field of face recognition to the large-scale speaker verification by re-grouping strategy, mapping N to M(M is a hyperparameter less than N), so that the number of weight parameters in this layer becomes M/N times to the original.We also explored the effect of the number of utterances per speaker in each min-batch on the performance.KeywordsSpeaker verificationVirtual fully-connected layerLarge-scale

Full Text