The use of long-term features for GMM- and i-vector-based speaker diarization systems

Abraham Woubie Zewoudie,Jordi Luque,Javier Hernando

doi:10.1186/s13636-018-0140-x

Abstract

Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones.In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level.Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.

Highlights

An audio recording normally consists of different speakers, music segments, noises, etc
Note that the baseline system is based on Bayesian information criterion (BIC) clustering and static Mel frequency cepstral coefficients (MFCCs) feature set both for segmentation and clustering
For the Gaussian mixture modeling (GMM)-based speaker diarization system, the table depicts that the fusion of the concatenated or individual long-term speech features with the cepstral coefficients provides better diarization error rate (DER) compared to using only the later feature set

Summary

Introduction

An audio recording normally consists of different speakers, music segments, noises, etc. Speaker diarization needs to first classify the speech and non-speech parts of the audio signal. It marks the speaker changes in the detected speech. It clusters speech segments which belong to the same speaker [1]. One of the factors that affect the performance of speaker diarization systems is the extraction of relevant speaker features. Mel frequency cepstral coefficients (MFCCs) are the most widely used short-term speech features for speaker diarization [2]. Despite its broad employment in speaker diarization, it is described in [3, 4] that

Results

Discussion

Conclusion