Speaker diarization with variants of self-attention and joint speaker embedding extractor

Pengbin Fu,Yuchen Ma,Huirong Yang

doi:10.3233/jifs-230249

Abstract

The speaker diarization task pertains to the automated differentiation of speakers within an audio recording, while lacking any prior information regarding the speakers. The introduction of the self-attention mechanism in End-to-End Neural Speaker Diarization (EEND) has elegantly resolved the issue of overlapping speakers. The Transformer model equipped with self-attention mechanism has shown great potential in collecting global information, yielding remarkable outcomes in various tasks. However, the individual speaker characteristics are predominantly reflected in the contextual information, which conventional self-attention would not adequately address. In this study, we propose a hierarchical encoders model to augment the encoders’ acquisition of speaker information in two distinct ways: (1) Constraining the perceptual field of the self-attentive mechanism with left-right windows or Gaussian weights to highlight contextual information; (2) Utilizing a pre-trained time-delay neural network based speaker embedding extractor to alleviate the shortcomings of speaker feature extraction ability. We evaluate the proposed methods on a simulated dataset of two speakers and a real conversation dataset. The model with the most favorable outcomes among the proposed enhancements achieves a diarization error rate of 7.74% on the simulated dataset and 21.92% on MagicData-RAMC after adaptation. These results compellingly demonstrate the efficacy of the proposed methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Speaker diarization with variants of self-attention and joint speaker embedding extractor

Abstract

Talk to us

Similar Papers

More From: Journal of Intelligent & Fuzzy Systems

Lead the way for us

Journal: Journal of Intelligent & Fuzzy Systems	Publication Date: Nov 4, 2023
Citations: 1

Similar Papers

Multimodal Speaker Segmentation and Diarization Using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks
Tae Jin Park ... Panayiotis Georgiou
-
Tae Jin Park, et. al.Tae Jin Park ... Panayiotis Georgiou
02 Sep 2018
02 Sep 2018

A review of speaker diarization: Recent advances with deep learning
Tae Jin Park ... Shrikanth Narayanan
Computer Speech & Language | VOL. 72
Tae Jin Park, et. al.Tae Jin Park ... Shrikanth Narayanan
13 Nov 2021
Computer Speech & Language | VOL. 72

An Attention-based Neural Network on Multiple Speaker Diarization
Shao Wen Cheng ... Kai Jyun Hung
-
Shao Wen Cheng, et. al.Shao Wen Cheng ... Kai Jyun Hung
13 Jun 2022
13 Jun 2022

Speaker Diarization with Deep Learning Techniques
Kshirod Sarmah Kshirod
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 11
Kshirod Sarmah KshirodKshirod Sarmah Kshirod
15 Dec 2020
Turkish Journal of Computer and Mathematics Education (TURCOMAT) | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speaker diarization with variants of self-attention and joint speaker embedding extractor

Abstract

Talk to us

Similar Papers

More From: Journal of Intelligent &amp; Fuzzy Systems

More From: Journal of Intelligent & Fuzzy Systems