Improved i-Vector Representation for Speaker Diarization

Yan Xu,Kui Wu,Ian Mcloughlin,Yan Song

doi:10.1007/s00034-015-0206-2

Abstract

This paper proposes using a previously well-trained deep neural network (DNN) to enhance the i-vector representation used for speaker diarization. In effect, we replace the Gaussian mixture model typically used to train a universal background model (UBM), with a DNN that has been trained using a different large-scale dataset. To train the T-matrix, we use a supervised UBM obtained from the DNN using filterbank input features to calculate the posterior information and then MFCC features to train the UBM instead of a traditional unsupervised UBM derived from single features. Next we jointly use DNN and MFCC features to calculate the zeroth- and first-order Baum---Welch statistics for training an extractor from which we obtain the i-vector. The system will be shown to achieve a significant improvement on the NIST 2008 speaker recognition evaluation telephone data task compared to state-of-the-art approaches.

Highlights

Speaker diarization is a technology used to solve the problem of “who spoke what and when did they speak” in a multi-party conversation
The main difference being that we propose replacing the universal background model (UBM)/i-vector extractor with a welltrained deep neural network (DNN)/i-vector extractor that has been trained on a phonetic basis using a much larger database
It is noticeable that the performance degradation between long and short sentences is significantly reduced compared to the results in Table 2, indicating that the DNN/i-vector system is less sensitive to the source sentence length than the UBM/i-vector system

Summary

Introduction

Speaker diarization is a technology used to solve the problem of “who spoke what and when did they speak” in a multi-party conversation. Previous approaches model each segment with a single GMM model or i-vector extracted from a universal background model (UBM), for example in [18] This has been shown capable of representing some segments in speaker diarization quite well, but the complexity and capability of the model are relatively low, and it is not always able to represent all of the underlying speech. We model the variance of all outputs in a similar way to a total variability (TV) system [6] and subsequently combine the DNN and TV information into a new representation that we denote DNN/i-vector The performance of this proposed approach will be evaluated with various system-level parameters against current state-of-the-art UBM/i-vector methods.

Diarization Overview

Selection of Input Features

Experiments and Results

Summary

Conclusion