Abstract

Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely, feature learning and cluster estimation. In our method, the deep representation feature is learned by a deep convolutional autoencoder network (DCAN), while the cluster estimation is realized by a softmax layer that is combined with the DCAN. We devise an integrated loss function to simultaneously minimize the reconstruction loss (for deep representation learning) and the clustering loss (for cluster estimation). Many state-of-the-art audio features and clustering methods are evaluated on experimental datasets selected from two publicly available speech corpora (the AISHELL-2 and the VoxCeleb1). The results show that the proposed method exceeds other speaker clustering methods in regard to the normalized mutual information (NMI) and the clustering accuracy (CA). Additionally, the proposed deep representation feature outperforms other features that were widely used in previous works, in terms of both NMI and CA.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call