Attention-Based Temporal-Frequency Aggregation for Speaker Verification.

Meng Wang,Tingting Su,Dazheng Feng,Mohan Chen

doi:10.3390/s22062147

Abstract

Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.

Highlights

Speaker verification (SV) is a voice biometric authentication technology developed to judge the claimed identity of a test speaker
A novel shared-parameter grouped frequency self-attentive pooling (SGFSAP) layer is proposed to effectively capture the speaker-dependent information contained in the frequency domain based on the following facts: (1) The speaker-dependent information is distributed in the time domain and frequency domain of the 2D frame-level features generated by the Convolutional neural networks (CNNs); (2) the individual information is encoded non-uniformly in different frequency bands of utterance [28]; (3) some speaker-dependent frequency information varies with the phonetic contents of the utterance [26,28,29]
A novel shared-parameter grouped frequency self-attentive pooling layer is proposed to capture the speaker-dependent information contained in the frequency domain

Summary

Introduction

Speaker verification (SV) is a voice biometric authentication technology developed to judge the claimed identity of a test speaker. The framework composed of i-vector [6] and probabilistic linear discriminant analysis (PLDA) [7] has dominated the text-independent SV because of its superior performance, simplicity, and efficiency. In this framework, a Gaussian mixture model-universal background model (GMM-UBM) [8] is first used to collect sufficient statistics. The i-vector/PLDA system can achieve great success in some scenarios, the performance of the system decreases when enrollment/test utterance durations are short [9,10]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors	Publication Date: Mar 10, 2022
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Attention-Based Temporal-Frequency Aggregation for Speaker Verification.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition.
Fei Deng ... Qiang Yang
Sensors (Basel, Switzerland) | VOL. 23
Fei Deng, et. al.Fei Deng ... Qiang Yang
20 Jan 2023
Sensors (Basel, Switzerland) | VOL. 23

On robustness of speech based biometric systems against voice conversion attack
Monisankha Pal ... Goutam Saha
Applied Soft Computing | VOL. 30
Monisankha Pal, et. al.Monisankha Pal ... Goutam Saha
07 Feb 2015
Applied Soft Computing | VOL. 30

Performance of I-vector speaker verification and the detection of synthetic speech
Richard D Mcclanahan ... Bryan Stewart
-
Richard D Mcclanahan, et. al.Richard D Mcclanahan ... Bryan Stewart
01 May 2014
01 May 2014

A Multi-task Conformer for Spoofing Aware Speaker Verification
Bao Thang Ta ... Dinh Son Dang
-
Bao Thang Ta, et. al.Bao Thang Ta ... Dinh Son Dang
27 Jul 2022
27 Jul 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Attention-Based Temporal-Frequency Aggregation for Speaker Verification.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors