Abstract

End-to-end neural diarization (EEND) which has the capability to directly output speaker diarization results and handle overlapping speech has attracted more and more attention due to its promising performance. Although existing EEND-based methods often outperform clustering-based methods, they cannot generalize well to unseen test sets because fixed attractors are often utilized to estimate speech activities of each speaker. An iterative adaptive attractor estimation (IAAE) network was proposed to refine diarization results, in which the self-attentive EEND (SA-EEND) was implemented to initialize diarization results and frame-wise embeddings. There are two main parts in the proposed IAAE network: an attention-based pooling was designed to obtain a rough estimation of the attractors based on the diarization results of the previous iteration, and an adaptive attractor was then calculated by using transformer decoder blocks. A unified training framework was proposed to further improve the diarization performance, making the embeddings more discriminable based on the well separated attractors. We evaluated the proposed method on both the simulated mixtures and the real CALLHOME dataset using the diarization error rate (DER). Our proposed method provides relative reductions in DER by up to 44.8% on simulated 2-speaker mixtures and 23.6% on the CALLHOME dataset over the baseline SA-EEND at the 2nd iteration step. We also demonstrated that with an increasing number of refinement steps applied, the DER on the CALLHOME dataset could be further reduced to 7.36%, achieving the state-of-the-art diarization results when compared with other methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call