Abstract

One of the most recent speaker recognition methods that demonstrates outstanding performance in noisy environments involves extracting the speaker embedding using attention mechanism instead of average or statistics pooling. In the attention method, the speaker recognition performance is improved by employing multiple heads rather than a single head. In this paper, we propose advanced methods to extract a new embedding by compensating for the disadvantages of the single-head and multi-head attention methods. The combination method comprising single-head and split-based multi-head attentions shows a 5.39% Equal Error Rate (EER). When the single-head and projection-based multi-head attention methods are combined, the speaker recognition performance improves by 4.45%, which is the best performance in this work. Our experimental results demonstrate that the attention mechanism reflects the speaker’s properties more effectively than average or statistics pooling, and the speaker verification system could be further improved by employing combinations of different attention techniques.

Highlights

  • The key to good speaker recognition systems lies in generating speaker features that can effectively distinguish different speakers

  • We evaluated the performance using the Equal Error Rate (EER) measure, which is typically used in speaker verification

  • The False Rejection Rate (FRR) is the percent of incorrectly rejected true users and it is identical to False Negative Rate (FNR)

Read more

Summary

Introduction

The key to good speaker recognition systems lies in generating speaker features that can effectively distinguish different speakers. Conventional speaker recognition systems used spectral representations such as linear predictive coefficients (LPC) or Mel-frequency cepstral coefficients (MFCC) for speaker feature and Gaussian Mixture Model (GMM) for speaker modeling [1,2,3]. Studies using deep neural networks have been actively conducted since the mid-2010s These studies mainly used speaker features extracted from recurrent neural network (RNN) series models such as the time-delay neural network and long short-term memory (LSTM). Acoustic feature such as MFCC or Mel-filter bank outputs are used as input to the deep neural model and a fully connected layers are usually added to the model. Average or statistical pooling is applied to the model or the output stage to convert frame-level representation into utterance-level representation, and the embedding is obtained as speaker features [5,6,7,8,9]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call