Abstract

Speaker recognition achieved great progress recently, however, it is not easy or efficient to further improve its performance via traditional solutions: collecting more data and designing new neural networks. Aiming at the fundamental challenge of speech data, i.e. low information density, multimodal learning can mitigate this challenge by introducing richer and more discriminative information as input for identity recognition. Specifically, since the face image is more discriminative than the speech for identity recognition, we conduct multimodal learning by introducing a face recognition model (teacher) to transfer discriminative knowledge to a speaker recognition model (student) during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between face and speech can easily lead to overfitting. In this work, we introduce a multimodal learning framework, VGSR (Vision-Guided Speaker Recognition). Specifically, we propose a MKD (Margin-based Knowledge Distillation) strategy for cross-modality distillation by introducing a loose constrain to align the teacher and student, greatly reducing overfitting. Our MKD strategy can easily adapt to various existing knowledge distillation methods. In addition, we propose a QAW (Quality-based Adaptive Weights) module to weight input samples via quantified data quality, leading to a robust model training. Experimental results on the VoxCeleb1 and CN-Celeb datasets show our proposed strategies can effectively improve the accuracy of speaker recognition by a margin of 10% ∼ 15%, and our methods are very robust to different noises.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.