Abstract

The majority of recent speaker verification tasks are studied under open-set evaluation scenarios considering real-world conditions. The characteristics of these tasks imply that the generalization towards unseen speakers is a critical capability. Thus, this study aims to improve the generalization of the system for the performance enhancement of speaker verification. To achieve this goal, we propose a novel supervised-learning-method-based speaker verification system using the mean teacher framework. The mean teacher network refers to the temporal averaging of deep neural network parameters, which can produce a more accurate, stable representations than fixed weights at the end of training and is conventionally used for semi-supervised learning. Leveraging the success of the mean teacher framework in many studies, the proposed supervised learning method exploits the mean teacher network as an auxiliary model for better training of the main model, the student network. By learning the reliable intermediate representations derived from the mean teacher network as well as one-hot speaker labels, the student network is encouraged to explore more discriminative embedding spaces. The experimental results demonstrate that the proposed method relatively reduces the equal error rate by 11.61%, compared to the baseline system.

Highlights

  • Academic Editor: ArcangeloSpeaker verification (SV) is the task of authenticating whether a speaker of an unknown input utterance matches the target speaker, and it is widely used in applications, such as voice assistant systems [1,2]

  • The baseline system is a RawNet2-based model with several modifications, and reported improved performances based on the equal error rate (EER) compared to the original RawNet2

  • This result indicates that the supervised mean teacher (MT) framework proposed in this study can improve the generalization of SV system

Read more

Summary

Introduction

Academic Editor: ArcangeloSpeaker verification (SV) is the task of authenticating whether a speaker of an unknown input utterance matches the target speaker, and it is widely used in applications, such as voice assistant systems [1,2]. Recent SV systems are primarily studied as an open-set scenario that tests using the utterances of speakers not seen in the training phase, requiring strong generalization [2,3]. Considering these characteristics of SV, many researchers have aimed to extract discriminative speaker embeddings from utterances by exploiting deep neural networks (DNNs). We noted from the results of a study that solely averaging DNN parameters after each step in the training phase can converge to better local minima [4] This technique is called “temporal averaging”; the temporal averaging of weights can lead to more stable and accurate results than the final weights when the training has been completed. The MT is the temporal averaging model of the student network and can generate a relatively

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.