Abstract

Self-supervised contrastive learning (SSCL) has achieved a great success in speaker verification (SV). All recent works treat within-utterance speaker embeddings (SE) to be positive instances, encouraging them to be as close as possible. However, positive instances from the same utterance have similar channel and related semantic information, which are difficult to distinguish from the speaker features. Moreover, these positive instances can only provide limited variances in a certain speaker. To tackle the above problems, we propose to use nearest neighbor (NN) positive instances for training, which are selected from a dynamic queue. The NN positive instances can provide different channel and semantic information, increasing the variances in a certain speaker. Our proposed method are validated through comprehensive experiments on VoxCeleb and CNCeleb1 datasets, demonstrating its effectiveness in improving both SSCL and fine-tuning results. Additionally, our SSCL model outperforms supervised training model in cross-dataset testing due to the use of massive unlabeled data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.