Multimodal Diarization Systems by Training Enrollment Models as Identity Representations

Victoria Mingote,Pablo Gimeno,Ignacio Viñals,Eduardo Lleida,Alfonso Ortega,Antonio Miguel

doi:10.3390/app12031141

Abstract

This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity to each segment where a person is detected. In this work, we implemented two different subsystems to address this task using the audio and the video from audiovisual files separately. To develop our subsystems, we used the state-of-the-art speaker and face verification embeddings extracted from publicly available deep neural networks (DNN). Different clustering techniques were also employed in combination with the tracking and identity assignment process. Furthermore, we included a novel back-end approach in the face verification subsystem to train an enrollment model for each identity, which we have previously shown to improve the results compared to the average of the enrollment data. Using this approach, we trained a learnable vector to represent each enrollment character. The loss function employed to train this vector was an approximated version of the detection cost function (aDCF) which is inspired by the DCF widely used metric to measure performance in verification tasks. In this paper, we also focused on exploring and analyzing the effect of training this vector with several configurations of this objective loss function. This analysis allows us to assess the impact of the configuration parameters of the loss in the amount and type of errors produced by the system.

Highlights

A multimodal biometric verification field consists of the identification of persons by means of more than one biometric characteristics, as the use of two modalities makes the process more robust to potential problems
Different alternatives have been presented in the literature to design loss functions focused on the final evaluation metrics to train the deep neural networks (DNN) systems such as the approximated area under the ROC curve [16,17], the partial and multiclass AUC loss [18,19,20], and the approximated detection cost function [14] which was used for this work
We compared the use of a cosine similarity metric directly on the embeddings extracted from the pretrained model (AverageEmbedding) to obtain the closest identity in each instance with the training face enrollment models approach (EnrollmentModels) for the identity assignment process

Summary

Introduction

A multimodal biometric verification field consists of the identification of persons by means of more than one biometric characteristics, as the use of two modalities makes the process more robust to potential problems. Face and voice characteristics have been two of the preferred biometric data due to the ease of obtaining audiovisual resources to carry out the systems that perform this process When this identification process is applied throughout a video file, and this information is kept over time, this kind of task is known as multimodal diarization combined with identity assignment. In recent years, this field has been widely investigated due to its great interest, motivated by the fact that human perception uses acoustic information and visual information to reduce speech uncertainty.

RTVE 2020 Challenge

Face Enrollment Models

Training Process of Enrollment Models

Face Subsystem

Frame Extraction

Face Detection

Change Shot Detection

Embedding Extraction

Training Face Enrollment Models

Clustering

Tracking and Identity Assignment Scoring

Speaker Subsystem

Front-End and Speech Activity Detection

Speaker Change Point Detection

Identity Assignment Scoring

Performance Metrics

Results

Analysis of Training Enrollment Models for Face Subsystem

Summary of Face and Speaker Results

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multimodal Diarization Systems by Training Enrollment Models as Identity Representations

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences

Lead the way for us

Journal: Applied sciences	Publication Date: Jan 21, 2022
License type: CC BY 4.0

Similar Papers

A Secure Face-Verification Scheme Based on Homomorphic Encryption and Deep Neural Networks
Yukun Ma ... Jiaoyu He
IEEE access : practical innovations, open solutions | VOL. 5
Yukun Ma, et. al.Yukun Ma ... Jiaoyu He
01 Jan 2017
IEEE access : practical innovations, open solutions | VOL. 5

Speaker Diarization For Vietnamese Conversations Using Deep Neural Network Embeddings
Tung Lam Nguyen ... Nhat Minh Le
-
Tung Lam Nguyen, et. al.Tung Lam Nguyen ... Nhat Minh Le
27 Jul 2022
27 Jul 2022

A Fast and Accurate System for Face Detection, Identification, and Verification
Rajeev Ranjan ... Rama Chellappa
IEEE transactions on biometrics, behavior, and identity science | VOL. 1
Rajeev Ranjan, et. al.Rajeev Ranjan ... Rama Chellappa
01 Apr 2019
IEEE transactions on biometrics, behavior, and identity science | VOL. 1

Integrated Replay Spoofing-Aware Text-Independent Speaker Verification
Hye-Jin Shim ... Ju-Ho Kim
Applied sciences | VOL. 10
Hye-Jin Shim, et. al.Hye-Jin Shim ... Ju-Ho Kim
10 Sep 2020
Applied sciences | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal Diarization Systems by Training Enrollment Models as Identity Representations

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences