Robust Speaker Verification with Joint Self-Supervised and Supervised Learning

Kai Wang,Xiaolei Zhang,Kiho Cho,Yuguang Li,Sung-Un Park,Jaeyun Lee,Miao Zhang

doi:10.1109/icassp43922.2022.9747209

Abstract

Supervised learning and self-supervised learning address different facets. Supervised learning achieves high accuracy, but it requires numerous expensive labeled data indeed. Correspondingly, self-supervised learning, makes use of abundant unlabeled data to learn, but the performance lags behind that of the supervised counterpart. To overcome the difficulty of acquiring annotated data and contain the high performance in the context of speaker verification, we propose in this work a self-supervised joint learning (SS-JL) framework which complements the supervised main task with self-supervised auxiliary tasks in joint training. These auxiliary tasks help the speaker verification pipeline to generate robust speaker representation that is closely relevant to voiceprints. Our model is trained on English dataset and tested on multilingual datasets, including English, Chinese and Korean datasets, and 13.6%, 12.7% and 13.5% improvement is achieved respectively in terms of equal error rate (EER) compared with the baselines.

Full Text