Abstract

This paper proposes using a previously well-trained deep neural network (DNN) to enhance the i-vector representation used for speaker diarization. In effect, we replace the Gaussian mixture model typically used to train a universal background model (UBM), with a DNN that has been trained using a different large-scale dataset. To train the T-matrix, we use a supervised UBM obtained from the DNN using filterbank input features to calculate the posterior information and then MFCC features to train the UBM instead of a traditional unsupervised UBM derived from single features. Next we jointly use DNN and MFCC features to calculate the zeroth- and first-order Baum---Welch statistics for training an extractor from which we obtain the i-vector. The system will be shown to achieve a significant improvement on the NIST 2008 speaker recognition evaluation telephone data task compared to state-of-the-art approaches.

Highlights

  • Speaker diarization is a technology used to solve the problem of “who spoke what and when did they speak” in a multi-party conversation

  • The main difference being that we propose replacing the universal background model (UBM)/i-vector extractor with a welltrained deep neural network (DNN)/i-vector extractor that has been trained on a phonetic basis using a much larger database

  • It is noticeable that the performance degradation between long and short sentences is significantly reduced compared to the results in Table 2, indicating that the DNN/i-vector system is less sensitive to the source sentence length than the UBM/i-vector system

Read more

Summary

Introduction

Speaker diarization is a technology used to solve the problem of “who spoke what and when did they speak” in a multi-party conversation. Previous approaches model each segment with a single GMM model or i-vector extracted from a universal background model (UBM), for example in [18] This has been shown capable of representing some segments in speaker diarization quite well, but the complexity and capability of the model are relatively low, and it is not always able to represent all of the underlying speech. We model the variance of all outputs in a similar way to a total variability (TV) system [6] and subsequently combine the DNN and TV information into a new representation that we denote DNN/i-vector The performance of this proposed approach will be evaluated with various system-level parameters against current state-of-the-art UBM/i-vector methods.

Diarization Overview
Selection of Input Features
Experiments and Results
Summary
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call