Abstract

In this paper, we propose a novel method to segment and label pass-phrase utterances for training deep neural network (DNN) bottleneck (BN) features for text-dependent speaker verification (TD-SV). Specifically, gender-dependent hidden Markov models (HMMs) for monophones are first trained using the pass-phrase utterances that are disjoint from evaluation. Next, the trained HMMs are speaker-adapted and then used for segmenting and labeling these training utterances at the phone level. The resulted labeled data is subsequently used for training DNN models to discriminate gender-dependent phones for the purpose of extracting phone-discriminant BN features. This is in contrast to conventional approaches that apply a general-purpose, speaker-independent automatic speech recognition (ASR) system for generating segmentation and labels. The proposed method eliminates the need for a separate ASR system, which can additionally have the disadvantage of mismatch with the pass-phrase utterances in terms languages, dialects, domains, acoustic conditions and so on. Experiments are conducted on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Experimental results demonstrate that the proposed method yields lower error rates in TD-SV when compared to a set of existing methods. A thorough ablation study further confirms the effectiveness of the method. Fusion in both score and feature levels also shows the complementary nature of the proposed features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call