Abstract
Recent studies prove that speaker verification performance improves by employing an attention mechanism compared to using temporal and statistical pooling techniques. This paper proposes an advanced multi-head attention method, which utilizes a sorted vector of the frame-level features to consider a higher correlation. In this study, we also propose a transfer learning scheme to maximize the effectiveness of the two loss functions, which are the classifier-based cross entropy loss function and metric-based GE2E loss function, to learn the distance between embeddings. The sorted multi-head attention (SMHA) method outperforms the conventional attention methods showing 4.55% in equal error rate (EER). The proposed transfer learning scheme with Class-GE2E loss function significantly improved our attention-based systems. In particular, the EER of the SMHA decreased to 4.39% by employing transfer learning with Class-GE2E loss. The experimental results demonstrate that our effort to include a greater correlation between frame-level features for multi-head attention processing, and the combining of two different loss functions through transfer learning, is highly effective for improving speaker verification performance.
Highlights
SpeakerSpeaker verification verification determines determines whether whether aa speaker speaker is is registered registered in in the the system. system
The overall process is the same as that of single-head attention; the largest difference is that frame-level features are divided by the number of heads, each passing through the attention layer, as expressed in (3)
We propose a sorted multi-head attention that generates sub-embedding by dividing the ordered values of frame-level features to consider the correlation between the features while computing the weights
Summary
Speaker verification verification determines determines whether whether aa speaker speaker is is registered registered in in the the system. In neural machine translation (NMT), an attention mechanism was introduced to aslarge weights to the features that are useful for generating new domain features. In neural-network-based speaker recognition systems, for each speaker’s utterance, an important frame-level representation is captured to generate speaker embedding with a fixed length. Temporal pooling [5], known as average pooling, averages frame-level representations extracted from neural networks on a time axis, and statistics pooling [6] calculates the average and standard deviation These methods generated speaker embedding with a fixed length. The triple loss function [5] has been proposed to learn more distinguishing characteristics from other speakers. We propose a neural model that uses classifier-based loss and GE2E loss functions together in transfer learning to obtain more effective speaker characteristics and embedding for distinguishing speakers.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.