Unsupervised deep feature embeddings for speaker diarization

Rehan Ahmad,Syed Zubair

doi:10.3906/elk-1901-125

Abstract

Speaker diarization aims to determine ?who spoke when?? from multispeaker recording environments. In this paper, we propose to learn a set of high-level feature representations, referred to as feature embeddings, from an unsupervised deep architecture for speaker diarization. These sets of embeddings are learned through a deep autoencoder model when trained on mel-frequency cepstral coefficients (MFCCs) of input speech frames. Learned embeddings are then used in Gaussian mixture model based hierarchical clustering for diarization. The results show that these unsupervised embeddings are better compared to MFCCs in reducing the diarization error rate. Experiments conducted on the popular subset of the AMI meeting corpus consisting of 5.4 h of recordings show that the new embeddings decrease the average diarization error rate by 2.96%. However, for individual recordings, maximum improvement of 8.05% is acquired.

Highlights

Speaker diarization [1] is the process of partitioning an audio recording into speaker homogeneous regions
Building upon the success of deep learning architectures, we propose feature embeddings learning based on autoencoders followed by hierarchical clustering for speaker diarization
In this paper we proposed a method for unsupervised feature embeddings extraction for speaker diarization

Summary

Introduction

Speaker diarization [1] is the process of partitioning an audio recording into speaker homogeneous regions. It answers the question of “who spoke when?” in a multispeaker environment. It is usually an unsupervised problem where the number of speakers and speaker-turn regions are unknown. The process automatically determines the speaker-specific segments and groups similar ones to form a speaker-specific diary. Its application lies in multimedia information retrieval, speaker recognition, and audio processing. Use cases of diarization include the analysis of speakers and their speech in meeting recordings, TV/talk shows, movies, phone conversations, conferences, or any other multispeaker recordings

Methods

Results

Conclusion