Speaker Verification Models Research Articles

This research presents an extensive comparative analysis of a selection of popular deep speaker embedding models, namely WavLM, TitaNet, ECAPA, and PyAnnote, applied in speaker verification tasks. The study employs a specially curated dataset, specifically designed to mirror the real-world operating conditions of voice models as accurately as possible. This dataset includes short, non-English statements gathered from interviews on a popular online video platform. The dataset features a wide range of speakers, with 33 males and 17 females, making a total of 50 unique voices. These speakers vary in age from 20 to 70 years old. This variety helps in thoroughly testing speaker verification models. This dataset is especially useful for research on speaker verification with short recordings. It consists of 10 clips for each person, each clip being no longer than 10 s, adding up to 500 recordings in total. The total length of all recordings is about 1 h and 30 min, which averages to roughly 100 s for each speaker. This dataset is a valuable tool for research in speaker verification, particularly for studies involving short audio clips. The performance of these models is evaluated using common biometric metrics such as false acceptance rate (FAR), false rejection rate (FRR), equal error rate (EER) and detection cost function (DCF). The results reveal that the TitaNet and ECAPA models stand out by presenting the lowest EER (1.91% and 1.71%, respectively) and thus exhibiting higher discriminative features, ensuring, on the one hand, a reduction in intra-class distance (the same speaker), and, on the other hand, maximizing the distance between different speaker embeddings. This analysis also highlights the ECAPA model’s advantageous balance of performance and efficiency, achieving an inference time of 69.43 milliseconds, slightly longer than the PyAnnote models. This study not only compares the performance of models but also provides a comparative analysis of respective model embeddings, offering insights into their strengths and weaknesses. The presented findings serve as a foundation for guiding future research in speaker verification, especially in the context of short audio samples or limited data. This may be particularly relevant for applications requiring quick and accurate speaker identification from short voice clips.

Voice Conversion (VC) is a method of converting the source speaker's speech into the target speaker's speech without changing the source speaker's speech content. The current VC methods have the following problems: (1) they are only applicable to a limited number of speakers, not to any speakers, as a result, the application scenarios are greatly restricted; (2) the representation (feature) separation(RS) effect of the current mainstream technology is not ideal on the source speaker speech and the target speaker speech; and (3) the voice conversion quality of most models is unsatisfactory, and hence needs to be improved. Therefore, in this paper, we constructed a one-shot VC model of Representation Separation, called RS-VC model, implemented by the encoder-decoder structure. The encoder is composed of a content encoder and a speaker encoder. The content encoder separates the content information of the source speaker voice and generates a content representation. The speaker encoder separates the target speaker information of the target speaker voice and generates a speaker representation. The decoder synthesizes the content representation and the speaker representation to generate the converted voice. In this paper, we obtained the optimized speaker verification model SVIGEN2E (Speaker Verification with Instance Normalization using Generalized End-to-End loss) by improving the speaker verification (SV) model. The model SVIGEN2E is used as the speaker encoder. This speaker encoder needs to be trained in advance prior to RS-VC model training, and the pre-trained model of SVINGE2E directly extracts speaker representation of the target speaker's voice, and is used for training and testing RS-VC model. A progressive training method is proposed then for training RS-VC model. Experiments show that the progressive training method can effectively improve the quality of the converted voice. Compared with the basic speaker verification model, both SVINGE2E and RS-VC deliver the impressive improvements in EER (Equal Error Rate).

Speaker Verification Models Research Articles

Related Topics

Articles published on Speaker Verification Models

Comparison of Modern Deep Learning Models for Speaker Verification

A Lightweight CNN-Conformer Model for Automatic Speaker Verification

DR-SASV: A deep and reliable spoof aware speech verification system

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

Two-Tier Feature Extraction with Metaheuristics-Based Automated Forensic Speaker Verification Model

Collaborative and adversarial network for text‐independent speaker verification in domain adaptation

Lambda-vector modeling temporal and channel interactions for text-independent speaker verification

SV - VLSP2021: The Smartcall - ITS’s Systems

End-to-End Speaker Verification via Curriculum Bipartite Ranking Weighted Binary Cross-Entropy

Attentional triplet neural networks for text-dependent speaker verification

One-Shot Voice Conversion Algorithm Based on Representations Separation

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

A lighten CNN-LSTM model for speaker verification on embedded devices

Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

Robust Speaker Identification and Verification in Adverse Acoustic Condition

Generalized Variability Model for Speaker Verification

Hybridized estimations of support vector machine free parameters C and γ using a fuzzy learning strategy for microphone array-based speaker recognition in a Kinect sensor-deployed environment

A General Bayesian Model for Speaker Verification

Speaker Verification via Modeling Kurtosis Using Sparse Coding

A nonlinear autoregressive model for speaker verification

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speaker Verification Models Research Articles

Related Topics

Articles published on Speaker Verification Models

Comparison of Modern Deep Learning Models for Speaker Verification

A Lightweight CNN-Conformer Model for Automatic Speaker Verification

DR-SASV: A deep and reliable spoof aware speech verification system

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

Two-Tier Feature Extraction with Metaheuristics-Based Automated Forensic Speaker Verification Model

Collaborative and adversarial network for text‐independent speaker verification in domain adaptation

Lambda-vector modeling temporal and channel interactions for text-independent speaker verification

SV - VLSP2021: The Smartcall - ITS’s Systems

End-to-End Speaker Verification via Curriculum Bipartite Ranking Weighted Binary Cross-Entropy

Attentional triplet neural networks for text-dependent speaker verification

One-Shot Voice Conversion Algorithm Based on Representations Separation

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

A lighten CNN-LSTM model for speaker verification on embedded devices

Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

Robust Speaker Identification and Verification in Adverse Acoustic Condition

Generalized Variability Model for Speaker Verification

Hybridized estimations of support vector machine free parameters C and γ using a fuzzy learning strategy for microphone array-based speaker recognition in a Kinect sensor-deployed environment

A General Bayesian Model for Speaker Verification

Speaker Verification via Modeling Kurtosis Using Sparse Coding

A nonlinear autoregressive model for speaker verification