Active Speaker Detection and Localization in Videos Using Low-Rank and Kernelized Sparsity

Jie Pu,Yannis Panagakis,Maja Pantic

doi:10.1109/lsp.2020.2996412

Abstract

A novel method for active speaker detection and localization in audio-visual recordings is proposed. The method relies on a specifically tailored matrix decomposition that exploits the intrinsic low-dimensional structure of audio-visual data, namely, the low-rank of the background visual/audio information and the sparsity of the correlated foreground components. Concretely, the data matrix of each modality is modeled as a superposition of two terms: 1) a low-rank matrix capturing the background information and 2) a kernelized sparse matrix capturing the non-linear correlated components among the audio and visual modalities and, hence, revealing the active speaker. To this end, we formulate an appropriate optimization problem that involves the minimization of nuclear- and matrix $\ell _1$-norms, and develop an efficient solver. Experimental results on active speaker detection and localization demonstrate the superior performance of the proposed method over other state-of-the-art approaches.

Full Text