Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks

Zhihan Zhou,Yichi Zhang,Zhiyao Duan

doi:10.1109/icassp.2018.8461666

Abstract

Speaker diarization (detecting who-spoke-when using relative identity labels) and speaker recognition (detecting absolute identity labels without timing) are different but related tasks that often need to be completed simultaneously in many scenarios. Traditional methods, however, address them independently. In this paper, we propose a method to jointly diarize and recognize speakers from a collection of conversations. This method benefits from the sparsity and temporal smoothness of speakers within a conversation and the large-scale timbre modeling across recordings and speakers. Specifically, we employ one convolutional neural network (CNN) to perform segment-level speaker classification and another CNN to detect the probability of speaker change within a conversation. We then concatenate the output of both CNNs and feed it into a recurrent neural network (RNN) for joint speaker diarization and recognition. Experiments on different datasets show promising performance of our proposed approach.

Full Text