Abstract

Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC), we propose a speaker diarization method using the graph attention-based deep embedded clustering (GADEC) to address the aforementioned issues. First, considering the temporal nature of speech signals, when segmenting the speech signal into small segments, the speech in the current segment and its neighboring segments may likely belong to the same speaker. This suggests that embeddings extracted from neighboring segments could help generate a more informative speaker representation for the current segment. To better describe the complex relationships between segments and leverage the local structural information among their embeddings, we construct a graph for the pre-extracted speaker embeddings in a continuous audio signal. On this basis, we introduce a graph attentional encoder (GAE) module to integrate information from neighboring nodes (i.e., neighboring segments) in the graph and learn latent speaker embeddings. Moreover, we further jointly optimize both the latent speaker embeddings and the clustering results within a unified framework, leading to more discriminative speaker embeddings for the clustering task. Experimental results demonstrate that our proposed GADEC-based speaker diarization system significantly outperforms the baseline systems and several other recent speaker diarization systems concerning diarization error rate (DER) on the NIST SRE 2000 CALLHOME, AMI, and VoxConverse datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call