Introducing phonetic information to speaker embedding for speaker verification

Yi Liu,Michael T Johnson,Liang He,Jia Liu

doi:10.1186/s13636-019-0166-8

Abstract

Phonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.

Highlights

Automatic speaker verification (ASV) has made great strides in the last two decades, moving from traditional Gaussian mixture model (GMM) approaches [1] to the i-vector framework [2] and neural network-based speaker embedding [3]
We have examined the new speaker embeddings in a multi-lingual VoxCeleb dataset, it is still interesting to investigate the performance of our proposed methods in the more challenging National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2016 and 2018
All the results reported are obtained by setting the learning rate scaling factor as 0.2 in the phonetic adaptation, and there is 1 layer shared in the multi-task learning

Summary

Introduction

Automatic speaker verification (ASV) has made great strides in the last two decades, moving from traditional Gaussian mixture model (GMM) approaches [1] to the i-vector framework [2] and neural network-based speaker embedding [3]. Based on Bayesian factor analysis, the i-vector framework converts a variable-length speech utterance into a fixed-length vector representing speaker characteristics. A variety of backend classifiers can be applied to suppress session variability and increase speaker discrimination. Even though the i-vector approach performed well in previous National Institute of Standards and Technology (NIST) speaker recognition evaluations (SREs), it is known to suffer from many problems in practical applications. The i-vector is a point estimate of the total variability factor, ignoring the covariance [2, 4].

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Dec 1, 2019
Citations: 24	License type: open-access

R Discovery Prime

R Discovery Prime

Introducing phonetic information to speaker embedding for speaker verification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

An investigation of domain adaptation in speaker embedding space for speaker recognition
Fahimeh Bahmaninezhad ... John H.L Hansen
Speech Communication | VOL. 129
Fahimeh Bahmaninezhad, et. al.Fahimeh Bahmaninezhad ... John H.L Hansen
23 Jan 2021
Speech Communication | VOL. 129

Text Independent Speaker Verficiation Using Dominant State Information of HMM-UBM
Shon
The Journal of the Acoustical Society of Korea | VOL. 34
Shon Shon
01 Jan 2015
The Journal of the Acoustical Society of Korea | VOL. 34

Speaker Characterization Using TDNN-LSTM Based Speaker Embedding
Chia-Ping Chen ... Chien-Lin Huang
-
Chia-Ping Chen, et. al.Chia-Ping Chen ... Chien-Lin Huang
01 May 2019
01 May 2019

Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition
Zhizheng Wu ... Haizhou Li
-
Zhizheng Wu, et. al.Zhizheng Wu ... Haizhou Li
09 Sep 2012
09 Sep 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Introducing phonetic information to speaker embedding for speaker verification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing