The speaker diarization task pertains to the automated differentiation of speakers within an audio recording, while lacking any prior information regarding the speakers. The introduction of the self-attention mechanism in End-to-End Neural Speaker Diarization (EEND) has elegantly resolved the issue of overlapping speakers. The Transformer model equipped with self-attention mechanism has shown great potential in collecting global information, yielding remarkable outcomes in various tasks. However, the individual speaker characteristics are predominantly reflected in the contextual information, which conventional self-attention would not adequately address. In this study, we propose a hierarchical encoders model to augment the encoders’ acquisition of speaker information in two distinct ways: (1) Constraining the perceptual field of the self-attentive mechanism with left-right windows or Gaussian weights to highlight contextual information; (2) Utilizing a pre-trained time-delay neural network based speaker embedding extractor to alleviate the shortcomings of speaker feature extraction ability. We evaluate the proposed methods on a simulated dataset of two speakers and a real conversation dataset. The model with the most favorable outcomes among the proposed enhancements achieves a diarization error rate of 7.74% on the simulated dataset and 21.92% on MagicData-RAMC after adaptation. These results compellingly demonstrate the efficacy of the proposed methods.
Read full abstract