Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings

Chunlei Zhang,Kazuhito Koishida,John H L Hansen

doi:10.1109/taslp.2018.2831456

Abstract

The effectiveness of introducing deep neural networks into conventional speaker recognition pipelines has been broadly shown to benefit system performance. A novel text-independent speaker verification (SV) framework based on the triplet loss and a very deep convolutional neural network architecture (i.e., Inception-Resnet-v1) are investigated in this study, where a fixed-length speaker discriminative embedding is learned from sparse speech features and utilized as a feature representation for the SV tasks. A concise description of the neural network based speaker discriminative training with triplet loss is presented. An Euclidean distance similarity metric is applied in both network training and SV testing, which ensures the SV system to follow an end-to-end fashion. By replacing the final max/average pooling layer with a spatial pyramid pooling layer in the Inception-Resnet-v1 architecture, the fixed-length input constraint is relaxed and an obvious performance gain is achieved compared with the fixed-length input speaker embedding system. For datasets with more severe training/test condition mismatches, the probabilistic linear discriminant analysis (PLDA) back end is further introduced to replace the distance based scoring for the proposed speaker embedding system. Thus, we reconstruct the SV task with a neural network based front-end speaker embedding system and a PLDA that provides channel and noise variabilities compensation in the back end. Extensive experiments are conducted to provide useful hints that lead to a better testing performance. Comparison with the state-of-the-art SV frameworks on three public datasets (i.e., a prompt speech corpus, a conversational speech Switchboard corpus, and NIST SRE10 10 s–10 s condition) justifies the effectiveness of our proposed speaker embedding system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Sep 1, 2018
Citations: 170

Similar Papers

Bottleneck and Embedding Representation of Speech for DNN-based Language and Speaker Recognition
Alicia Lozano-Diez ... Javier Gonzalez-Dominguez
-
Alicia Lozano-Diez, et. al.Alicia Lozano-Diez ... Javier Gonzalez-Dominguez
21 Nov 2018
21 Nov 2018

The Leap Speaker Recognition System for NIST SRE 2018 Challenge
Shreyas Ramoji ... Anand Mohan
-
Shreyas Ramoji, et. al.Shreyas Ramoji ... Anand Mohan
01 May 2019
The Leap Speaker Recognition System for NIST SRE 2018 Challenge
Shreyas Ramoji ... Anand Mohan

A Multi-Target Speaker Detection and Identification System Based on Combination of PLDA and DNN
Niksa M Jakovljevic ... Tijana V Delic
-
Niksa M Jakovljevic, et. al.Niksa M Jakovljevic ... Tijana V Delic
01 Nov 2018
01 Nov 2018

An iterative framework for unsupervised learning in the PLDA based speaker verification
Wenbo Liu ... Zhiding Yu
-
Wenbo Liu, et. al.Wenbo Liu ... Zhiding Yu
01 Sep 2014
01 Sep 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing