Abstract

In this article, we propose a new method for joint cochannel speaker separation and recognition called adaptive-dictionary non-negative matrix deconvolution (DANMD). This method is an extension of non-negative matrix deconvolution (NMD) which models spectrogram matrix as a linear combination of dictionary elements (atoms). We propose a dictionary which is a linear combination of speaker-independent component and components representing speaker variability. The dictionary is parametric and all atoms depend on a small number of parameters. The speaker-independent component and components representing speaker variability are learned from recordings of tens or hundreds of speakers. We show that the proposed method can be applied to the single-channel speech separation task where two speakers of unknown identity are to be separated. In a scenario where the unknown speakers’ recordings are in training dataset together with recordings of many other speakers, we show that the proposed method outperforms stacked NMD (NMD with a dictionary which contains atoms of all speakers in the dataset) in terms of signal-to-distortion ratio (SDR). DANMD was also tested in a scenario where recordings of the recognized speakers were not in the training dataset. In this case it brought clearly positive signal-to-distortion ratios. The proposed model was also tested for a co-channel speaker identification task, where the parameters of the adapted model are a basis for a decision about the identity of the speakers in the mixture. In this case, the accuracy was 81.2 in comparison to 84.1 in the case of stacked NMD. While the speaker recognition accuracy is lower for the new approach, we find the primary value in the improved SDR.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call