Abstract

Restricted Boltzmann Machines (RBMs) have shown success in both the front-end and backend of speaker verification systems. In this paper, we propose applying RBMs to the front-end for the tasks of speaker clustering and speaker tracking in TV broadcast shows. RBMs are trained to transform utterances into a vector based representation. Because of the lack of data for a test speaker, we propose RBM adaptation to a global model. First, the global model—which is referred to as universal RBM—is trained with all the available background data. Then an adapted RBM model is trained with the data of each test speaker. The visible to hidden weight matrices of the adapted models are concatenated along with the bias vectors and are whitened to generate the vector representation of speakers. These vectors, referred to as RBM vectors, were shown to preserve speaker-specific information and are used in the tasks of speaker clustering and speaker tracking. The evaluation was performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed speaker clustering system gained up to 12% relative improvement, in terms of Equal Impurity (EI), over the baseline system. On the other hand, in the task of speaker tracking, our system has a relative improvement of 11% and 7% compared to the baseline system using cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring, respectively.

Highlights

  • Deep learning has been successfully applied to various tasks of image and speech technologies in recent decades

  • We have proposed the use of Restricted Boltzmann Machine (RBM) vectors for the tasks of speaker tracking and speaker clustering in TV broadcast shows

  • These RBM vectors are used in the tasks of speaker clustering and speaker tracking

Read more

Summary

Introduction

Deep learning has been successfully applied to various tasks of image and speech technologies in recent decades. To extract the desired RBM vector, the first step is to train a global or universal model with a large amount of available background speakers’ utterances. This global model is referred to as URBM, which is supposed to convey speaker-independent information. The universal model is trained with a large number of training samples generated from the feature vectors of the background speakers’ utterances. This universal model is supposed to learn both speaker and session variabilities from the large background data [22]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call