Who spoke what? A latent variable framework for the joint decoding of multiple speakers and their keywords

Harshavardhan Sundar,Thippur V Sreenivas

doi:10.1109/spcom.2016.7746658

Abstract

In this paper, we present a latent variable (LV) framework to identify all the speakers and their keywords given a single channel microphone recording containing a multi-speaker mixture signal. We introduce two separate LVs to denote active speakers and the keywords uttered. The dependency of a spoken keyword on the speaker is modeled through a conditional probability mass function. The distribution of the mixture signal is expressed in terms of the LV mass functions and speaker-specific-keyword models. The proposed framework admits stochastic models, representing the probability density function of the observation vectors given that a particular speaker uttered a specific keyword, as speaker-specific-keyword models. The LV mass functions are estimated in a Maximum Likelihood framework using the Expectation Maximization (EM) algorithm. The active speakers and their keywords are detected as modes of the joint distribution of the two LVs. With Student's-t Mixture Models (tMMs) as speaker specific keyword models, the proposed approach is able to detect at least one speaker-keyword pair, in mixture signal with two speakers, with an accuracy of 99% and both speaker-keyword pairs, with an accuracy of 82%.

Full Text