Unsupervised adaptation of PLDA models for broadcast diarization

Ignacio Viñals,Alfonso Ortega,Jesús Villalba,Eduardo Lleida,Antonio Miguel

doi:10.1186/s13636-019-0167-7

Abstract

We present a novel model adaptation approach to deal with data variability for speaker diarization in a broadcast environment. Expensive human annotated data can be used to mitigate the domain mismatch by means of supervised model adaptation approaches. By contrast, we propose an unsupervised adaptation method which does not need for in-domain labeled data but only the recording that we are diarizing. We rely on an inner adaptation block which combines Agglomerative Hierarchical Clustering (AHC) and Mean-Shift (MS) clustering techniques with a Fully Bayesian Probabilistic Linear Discriminant Analysis (PLDA) to produce pseudo-speaker labels suitable for model adaptation. We propose multiple adaptation approaches based on this basic block, including unsupervised and semi-supervised. Our proposed solutions, analyzed with the Multi-Genre Broadcast 2015 (MGB) dataset, reported significant improvements (16% relative improvement) with respect to the baseline, also outperforming a supervised adaptation proposal with low resources (9% relative improvement). Furthermore, our proposed unsupervised adaptation is totally compatible with a supervised one. The joint use of both adaptation techniques (supervised and unsupervised) shows a 13% relative improvement with respect to only considering the supervised adaptation.

Highlights

Speaker diarization is the task intended to annotate an input audio document in terms of the speaker talking at each time
6.3 Independent unsupervised adaptation The previous results have shown the influence of domain mismatch when Probabilistic Linear Discriminant Analysis (PLDA) models are considered
We propose exploring the four possible pseudo-speaker label initializations described in [11]: two clustering modalities, Agglomerative Hierarchical Clustering (AHC) and MS, working with two similarity metrics, cosine similarity (COS) and PLDA loglikelihood ratio (PLDA)

Summary

Introduction

Speaker diarization is the task intended to annotate an input audio document in terms of the speaker talking at each time. A great effort on diarization research has been motivated by the increasing amount of available data, gathered in the wild. This type of data, too abundant to be manually tagged, becomes truly valuable if trustworthy speaker labels can be inferred. Diarization is a welldefined problem with multiple available resources, but still far from a general solution. Some diarization overviews, such as [1, 2], provide a wide point of view of the state of the art in diarization, being the most popular approach, the bottom-up clustering strategy. This strategy consists of two steps: the segmentation of some input audio into fragments with

Objectives

Methods

Results

Conclusion