A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

Frantisek Kynych,Petr Cerva,Jindrich Zdansky,Torbjørn Svendsen,Giampiero Salvi

doi:10.1186/s13636-024-00382-2

Abstract

This manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Nov 28, 2024
License type: CC BY 4.0

Similar Papers

Performance Analysis of Feature sets in Speaker Diarization techniques
C Sailaja ... Kasiprasad Manepalli
Journal of Physics: Conference Series | VOL. 1804
C Sailaja, et. al.C Sailaja ... Kasiprasad Manepalli
01 Feb 2021
Journal of Physics: Conference Series | VOL. 1804

Novel Approaches to Speaker Clustering for Speaker Diarization in Audio Broadcast News Data
Janez ibert ... France Miheli
-
Janez ibert, et. al.Janez ibert ... France Miheli
01 Nov 2008
01 Nov 2008

Multi-Source Information Fusion Technology and Its Application in Smart Distribution Power System
Xi He ... Wanli Yang
Sustainability | VOL. 15
Xi He, et. al.Xi He ... Wanli Yang
03 Apr 2023
Sustainability | VOL. 15

Real-Time Processing of Big Data Streams: Lifecycle, Tools, Tasks, and Challenges
Fatih Gurcan ... Muhammet Berigel
-
Fatih Gurcan, et. al.Fatih Gurcan ... Muhammet Berigel
01 Oct 2018
01 Oct 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing