Abstract

By constraining the lexical content of input speech, text-dependent speaker verification (TD-SV) offers more reliable performance than text-independent speaker verification (TI-SV) when dealing with short utterances. Because speech with constrained lexical content is harder to collect, often TD models are fine-tuned from a TI model using a small target phrase dataset. However, sometimes the target phrase dataset is too tiny for fine-tuning, which is the main obstacle for deploying TD-SV. One solution is to fine-tune the model using medium-size multi-phrase TD data and then deploy the model on the target phrase. Although this strategy does help in some cases, the performance is still sub-optimal because the model is not optimized for the target phrase. Inspired by the recent progress in meta-learning, we propose a three-stage pipeline for adapting a TI model to a TD model for the target phrase. Firstly, a TI model is trained using a large amount of speech data. Then, we use a multi-phrase TD dataset to tune the TI model via model-agnostic meta-learning. Finally, we perform fast adaptation using a small target phrase dataset. Results show that the three-stage pipeline consistently outperforms multi-phrase and target phrase fine-tuning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call