Abstract

AbstractRecently, convolutional neural networks (CNNs) have been widely used in speaker verification tasks and achieved the state-of-the-art performance in most dominant datasets, such as NIST SREs, VoxCeleb, CNCeleb and etc. However, suppose the speaker classification is performed by one-hot coding, the weight shape of the last fully-connected layer is \( B \times N \), B is the min-batch size, and N is the number of speakers, which will require large GPU memory as the number of speakers increases. To address this problem, we introduce a virtual fully-connected (Virtual FC) layer in the field of face recognition to the large-scale speaker verification by re-grouping strategy, mapping N to M(M is a hyperparameter less than N), so that the number of weight parameters in this layer becomes M/N times to the original.We also explored the effect of the number of utterances per speaker in each min-batch on the performance.KeywordsSpeaker verificationVirtual fully-connected layerLarge-scale

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.