Integration of audio-visual information for multi-speaker multimedia speaker recognition

Jichen Yang,Fangfan Chen,Yu Cheng,Pei Lin

doi:10.1016/j.dsp.2023.104315

Abstract

Recently, multi-speaker multimedia speaker recognition (MMSR) has garnered significant attention. Although prior research primarily focused on the back-end score level fusion of audio and visual information, this study delves into innovative techniques for integrating audio and visual cues from the front-end representations of both speaker's voice and face. The first method introduces the use of visual information to estimate the number of speakers. This solution addresses the challenges of estimating speaker numbers in multi-speaker conversations, especially in noisy environments. Subsequently, agglomerative hierarchical clustering is employed for speaker diarization, proving beneficial for MMSR. This approach is termed video aiding audio fusion (VAAF). The second method innovates by introducing a ratio factor to create a multimedia vector (M-vector) which concatenates face embeddings with x-vector. This amalgamation encapsulates both audio and visual cues. The resulting M-vector is then leveraged for MMSR. We name this method as video interacting audio fusion (VIAF). Experimental results on the NIST SRE 2019 audio-visual corpus reveal that the VAAF-based MMSR achieves a 6.94% and 8.31% relative reduction in minDCF and actDCF, respectively, when benchmarked against zero-effort systems. Additionally, the VIAF-based MMSR realizes a 12.08% and 12.99% relative reduction in minDCF and actDCF, respectively, compared to systems that solely utilize face embeddings. Notably, when combining both methods, the minDCF and actDCF metrics are further optimized, reaching 0.098 and 0.102, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Integration of audio-visual information for multi-speaker multimedia speaker recognition

Abstract

Talk to us

Similar Papers

More From: Digital Signal Processing

Lead the way for us

Similar Papers

The Use of Audio and Visual Cues of Audience and Their Effects on Persuasive Writing
Shangrela V Genon-Sieras
Proceedings Journal of Interdisciplinary Research | VOL. 2
Shangrela V Genon-SierasShangrela V Genon-Sieras
10 Oct 2015
Proceedings Journal of Interdisciplinary Research | VOL. 2

Visual Targeting of Forelimbs in Ladder-Walking Locusts
Jeremy E Niven ... Simon B Laughlin
Current Biology | VOL. 20
Jeremy E Niven, et. al.Jeremy E Niven ... Simon B Laughlin
31 Dec 2009
Current Biology | VOL. 20

The Effects of the Combination of Online Visual and Audio Cues on Salesperson Credibility
Kayoko Yashiro ... Shinichiro Haruyama
-
Kayoko Yashiro, et. al.Kayoko Yashiro ... Shinichiro Haruyama
01 Jul 2022
01 Jul 2022

The effect of social context on the use of visual information
Stephan Streuber ... Natalie Sebanz
Experimental Brain Research | VOL. 214
Stephan Streuber, et. al.Stephan Streuber ... Natalie Sebanz
24 Aug 2011
Experimental Brain Research | VOL. 214

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integration of audio-visual information for multi-speaker multimedia speaker recognition

Abstract

Talk to us

Similar Papers

More From: Digital Signal Processing