Automatic speaker verification (ASV) exhibits unsatisfactory performance under domain mismatch conditions owing to intrinsic and extrinsic factors, such as variations in speaking styles and recording devices encountered in real-world applications. To ensure robust performance under unseen conditions, domain generalization has been explored. However, an inherent contradiction exists between model discrimination and domain generalization, in which the discrimination ability may be reduced while learning to generalize. In this paper, to extract discriminative yet domain-invariant representations, we propose the meta-generalized speaker verification (MGSV) via meta-learning. Specifically, we propose a metric-based distribution optimization and a gradient-based meta-optimization to simultaneously supervise the spatial relationship between embeddings and improve the generalization ability of the model on unseen domains. In addition, we design multiple-single (MS) and simulated speaker verification (SSV) sampling strategies based on single-domain (SD) and single-single (SS) strategies to simulate the train/test domain mismatch more relevantly, thereby mining transferable speaker-related knowledge. SSV is chosen as the most effective method, as it substantially improves the domain generalization by ensuring that the model has learned to discriminate efficiently. Additionally, to intuitively reflect the model performance on the unseen domains, the proposed method is validated on cross-genre, cross-device, and cross-dataset tasks. The experimental results demonstrate that our proposed method achieves remarkable performance in handling domain mismatch issues in speaker verification.
Read full abstract