Abstract
One of the most recent speaker recognition methods that demonstrates outstanding performance in noisy environments involves extracting the speaker embedding using attention mechanism instead of average or statistics pooling. In the attention method, the speaker recognition performance is improved by employing multiple heads rather than a single head. In this paper, we propose advanced methods to extract a new embedding by compensating for the disadvantages of the single-head and multi-head attention methods. The combination method comprising single-head and split-based multi-head attentions shows a 5.39% Equal Error Rate (EER). When the single-head and projection-based multi-head attention methods are combined, the speaker recognition performance improves by 4.45%, which is the best performance in this work. Our experimental results demonstrate that the attention mechanism reflects the speaker’s properties more effectively than average or statistics pooling, and the speaker verification system could be further improved by employing combinations of different attention techniques.
Highlights
The key to good speaker recognition systems lies in generating speaker features that can effectively distinguish different speakers
We evaluated the performance using the Equal Error Rate (EER) measure, which is typically used in speaker verification
The False Rejection Rate (FRR) is the percent of incorrectly rejected true users and it is identical to False Negative Rate (FNR)
Summary
The key to good speaker recognition systems lies in generating speaker features that can effectively distinguish different speakers. Conventional speaker recognition systems used spectral representations such as linear predictive coefficients (LPC) or Mel-frequency cepstral coefficients (MFCC) for speaker feature and Gaussian Mixture Model (GMM) for speaker modeling [1,2,3]. Studies using deep neural networks have been actively conducted since the mid-2010s These studies mainly used speaker features extracted from recurrent neural network (RNN) series models such as the time-delay neural network and long short-term memory (LSTM). Acoustic feature such as MFCC or Mel-filter bank outputs are used as input to the deep neural model and a fully connected layers are usually added to the model. Average or statistical pooling is applied to the model or the output stage to convert frame-level representation into utterance-level representation, and the embedding is obtained as speaker features [5,6,7,8,9]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have