Short-Time Spectral Aggregation for Speaker Embedding

Youzhi Tu,Man-Wai Mak

doi:10.1109/icassp39728.2021.9414094

Abstract

State-of-the-art speaker verification systems take frame-level acoustics features as input and produce fixed-dimensional embeddings as utterance-level representations. Thus, how to aggregate information from frame-level features is vital for achieving high performance. This paper introduces short-time spectral pooling (STSP) for better aggregation of frame-level information. STSP transforms the temporal feature maps of a speaker embedding network into the spectral domain and extracts the lowest spectral components of the averaged spectrograms for aggregation. Benefiting from the low-pass characteristic of the averaged spectrograms, STSP is able to preserve most of the speaker information in the feature maps using a few spectral components only. We show that statistics pooling is a special case of STSP where only the DC spectral components are used. Experiments on VoxCeleb1 and VOiCES 2019 show that STSP outperforms statistics pooling and multi-head attentive pooling, which suggests that leveraging more spectral information in the CNN feature maps can produce highly discriminative speaker embeddings.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Short-Time Spectral Aggregation for Speaker Embedding

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding
Youzhi Tu ... Man-Wai Mak
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30
Youzhi Tu, et. al.Youzhi Tu ... Man-Wai Mak
01 Jan 2021
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30

A Parameter-Free Pixel Correlation-Based Attention Module for Remote Sensing Object Detection
Xin Guan ... Yifan Dong
Remote Sensing | VOL. 16
Xin Guan, et. al.Xin Guan ... Yifan Dong
12 Jan 2024
Remote Sensing | VOL. 16

Sar Ship Detection Based on Swin Transformer and Feature Enhancement Feature Pyramid Network
Xiao Ke ... Tianwen Zhang
-
Xiao Ke, et. al.Xiao Ke ... Tianwen Zhang
17 Jul 2022
17 Jul 2022

Corrdrop: Correlation Based Dropout for Convolutional Neural Networks
Yuyuan Zeng ... Shu-Tao Xia
-
Yuyuan Zeng, et. al.Yuyuan Zeng ... Shu-Tao Xia
01 May 2020
01 May 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Short-Time Spectral Aggregation for Speaker Embedding

Abstract

Talk to us

Similar Papers