Abstract

Speaker diarization aims to determine ?who spoke when?? from multispeaker recording environments. In this paper, we propose to learn a set of high-level feature representations, referred to as feature embeddings, from an unsupervised deep architecture for speaker diarization. These sets of embeddings are learned through a deep autoencoder model when trained on mel-frequency cepstral coefficients (MFCCs) of input speech frames. Learned embeddings are then used in Gaussian mixture model based hierarchical clustering for diarization. The results show that these unsupervised embeddings are better compared to MFCCs in reducing the diarization error rate. Experiments conducted on the popular subset of the AMI meeting corpus consisting of 5.4 h of recordings show that the new embeddings decrease the average diarization error rate by 2.96%. However, for individual recordings, maximum improvement of 8.05% is acquired.

Highlights

  • Speaker diarization [1] is the process of partitioning an audio recording into speaker homogeneous regions

  • Building upon the success of deep learning architectures, we propose feature embeddings learning based on autoencoders followed by hierarchical clustering for speaker diarization

  • In this paper we proposed a method for unsupervised feature embeddings extraction for speaker diarization

Read more

Summary

Introduction

Speaker diarization [1] is the process of partitioning an audio recording into speaker homogeneous regions. It answers the question of “who spoke when?” in a multispeaker environment. It is usually an unsupervised problem where the number of speakers and speaker-turn regions are unknown. The process automatically determines the speaker-specific segments and groups similar ones to form a speaker-specific diary. Its application lies in multimedia information retrieval, speaker recognition, and audio processing. Use cases of diarization include the analysis of speakers and their speech in meeting recordings, TV/talk shows, movies, phone conversations, conferences, or any other multispeaker recordings

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call