Speaker Embedding Research Articles

Speaker recognition continues to grow as a research challenge in the field with expanded application in commercial, forensic, educational and general speech technology interfaces. However, challenges remain, especially for naturalistic audio streams including recordings with mismatch between train and test data (i.e., when train or system development data and enrollment/test data or application data are collected from different sources). Mismatch conditions (Hansen and Hasan, 2015) can be divided into two categories, extrinsic (channel, noise, etc.) and intrinsic (duration, language, and speaker traits including stress, emotion, Lombard effect, vocal effort, accent). Here, we investigate speaker recognition for the domain mismatch problem (intrinsic mismatch) especially for those challenges introduced by NIST (National Institute of Standards and Technology) SRE (speaker recognition evaluation) in 2016 and 2018. The challenges introduced in NIST SRE-16 and SRE-18 include language mismatch between train (used for the development of the system) and enrollment/test (used at the application phase). Here, we develop three alternative speaker embedding systems; i-vector, t-vector (an improved triplet loss solution), and x-vector. In addition, a number of unsupervised and supervised (using pseudo labels) methods are also studied for domain mismatch compensation, especially applied at the back-end level. These include adapted PLDA, adapted discriminant analysis, as well as score normalization and calibration methods using unlabeled in-domain data. We propose new variations to discriminant analysis with support vectors (SVDA) as well. These results confirm that SVDA can measurably improve speaker recognition performance for SRE-16 and SRE-18 tasks respectively by +15% and +8% in terms of min-Cprimary; and for EER the gains are +14% and +16% respectively, using i-vector speaker embeddings as the baseline. These advancements offer promising steps toward addressing speaker recognition in naturalistic audio streams.

Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all.In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings.We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker’s synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers’ neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.

Speaker Embedding Research Articles

Related Topics

Articles published on Speaker Embedding

Joint speaker diarization and speech recognition based on region proposal networks

U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data

Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning

A speaker verification backend with robust performance across conditions

H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model

Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

End-to-end recurrent denoising autoencoder embeddings for speaker identification

Combination of deep speaker embeddings for diarisation

Text-To-Speech Synthesis Using Transfer Learning

Privacy‐preserving speaker verification system based on binary I‐vectors

Design of a Multi-Condition Emotional Speech Synthesizer

An investigation of domain adaptation in speaker embedding space for speaker recognition

Masked cross self-attentive encoding based speaker embedding for speaker verification

Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation

Monaural Speech Separation Using Speaker Embedding From Preliminary Separation

Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network

Xi-Vector Embedding for Speaker Recognition

Transfer Learning From Speech Synthesis to Voice Conversion With Non-Parallel Training Data

CTNet: Conversational Transformer Network for Emotion Recognition

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speaker Embedding Research Articles

Related Topics

Articles published on Speaker Embedding

Joint speaker diarization and speech recognition based on region proposal networks

U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data

Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning

A speaker verification backend with robust performance across conditions

H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model

Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

End-to-end recurrent denoising autoencoder embeddings for speaker identification

Combination of deep speaker embeddings for diarisation

Text-To-Speech Synthesis Using Transfer Learning

Privacy‐preserving speaker verification system based on binary I‐vectors

Design of a Multi-Condition Emotional Speech Synthesizer

An investigation of domain adaptation in speaker embedding space for speaker recognition

Masked cross self-attentive encoding based speaker embedding for speaker verification

Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation

Monaural Speech Separation Using Speaker Embedding From Preliminary Separation

Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network

Xi-Vector Embedding for Speaker Recognition

Transfer Learning From Speech Synthesis to Voice Conversion With Non-Parallel Training Data

CTNet: Conversational Transformer Network for Emotion Recognition

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis