Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination

Ara Bae,Wooil Kim

doi:10.3390/electronics11060893

Abstract

Recent studies prove that speaker verification performance improves by employing an attention mechanism compared to using temporal and statistical pooling techniques. This paper proposes an advanced multi-head attention method, which utilizes a sorted vector of the frame-level features to consider a higher correlation. In this study, we also propose a transfer learning scheme to maximize the effectiveness of the two loss functions, which are the classifier-based cross entropy loss function and metric-based GE2E loss function, to learn the distance between embeddings. The sorted multi-head attention (SMHA) method outperforms the conventional attention methods showing 4.55% in equal error rate (EER). The proposed transfer learning scheme with Class-GE2E loss function significantly improved our attention-based systems. In particular, the EER of the SMHA decreased to 4.39% by employing transfer learning with Class-GE2E loss. The experimental results demonstrate that our effort to include a greater correlation between frame-level features for multi-head attention processing, and the combining of two different loss functions through transfer learning, is highly effective for improving speaker verification performance.

Highlights

SpeakerSpeaker verification verification determines determines whether whether aa speaker speaker is is registered registered in in the the system. system
The overall process is the same as that of single-head attention; the largest difference is that frame-level features are divided by the number of heads, each passing through the attention layer, as expressed in (3)
We propose a sorted multi-head attention that generates sub-embedding by dividing the ordered values of frame-level features to consider the correlation between the features while computing the weights

Summary

Introduction

Speaker verification verification determines determines whether whether aa speaker speaker is is registered registered in in the the system. In neural machine translation (NMT), an attention mechanism was introduced to aslarge weights to the features that are useful for generating new domain features. In neural-network-based speaker recognition systems, for each speaker’s utterance, an important frame-level representation is captured to generate speaker embedding with a fixed length. Temporal pooling [5], known as average pooling, averages frame-level representations extracted from neural networks on a time axis, and statistics pooling [6] calculates the average and standard deviation These methods generated speaker embedding with a fixed length. The triple loss function [5] has been proposed to learn more distinguishing characteristics from other speakers. We propose a neural model that uses classifier-based loss and GE2E loss functions together in transfer learning to obtain more effective speaker characteristics and embedding for distinguishing speakers.

Related Works

Single-Head Attention (SHA) Layer

Multi-Head Attention (MHA) Layer

Generalized End-to-End Loss

Sorted Multi-Head Attention (SMHA) Layer

Transfer Learning

Dataset

Model Architecture

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Journal: Electronics	Publication Date: Mar 12, 2022
License type: CC BY 4.0

Similar Papers

A segment-based speaker verification system using SUMMIT
Sridevi V Sarma ... Victor W Zue
-
Sridevi V Sarma, et. al.Sridevi V Sarma ... Victor W Zue
22 Sep 1997
22 Sep 1997

Speaker verification based on fusion of acoustic and articulatory information
Ming Li ... Vikram Ramanarayanan
-
Ming Li, et. al.Ming Li ... Vikram Ramanarayanan
25 Aug 2013
25 Aug 2013

Mixture linear prediction Gammatone Cepstral features for robust speaker verification under transmission channel noise
Ahmed Krobba ... Sid-Ahmed Selouani
Multimedia Tools and Applications | VOL. 79
Ahmed Krobba, et. al.Ahmed Krobba ... Sid-Ahmed Selouani
09 Mar 2020
Multimedia Tools and Applications | VOL. 79

Speech and Speaker Recognition Evaluation
Sadaoki Furui
-
Sadaoki FuruiSadaoki Furui
01 Jan 2007
01 Jan 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Class-GE2E: Speaker Verification Using Self-Attention and Transfer Learning with Loss Combination

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics