Abstract

A novel method for active speaker detection and localization in audio-visual recordings is proposed. The method relies on a specifically tailored matrix decomposition that exploits the intrinsic low-dimensional structure of audio-visual data, namely, the low-rank of the background visual/audio information and the sparsity of the correlated foreground components. Concretely, the data matrix of each modality is modeled as a superposition of two terms: 1) a low-rank matrix capturing the background information and 2) a kernelized sparse matrix capturing the non-linear correlated components among the audio and visual modalities and, hence, revealing the active speaker. To this end, we formulate an appropriate optimization problem that involves the minimization of nuclear- and matrix $\ell _1$-norms, and develop an efficient solver. Experimental results on active speaker detection and localization demonstrate the superior performance of the proposed method over other state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call