Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Radu Horaud ,Sylvain Guy ,Stéphane Lathuilière ,Pablo Mesejo

doi:10.48448/s2tz-qz16

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Radu Horaud , Sylvain Guy + Show 2 more

https://doi.org/10.48448/s2tz-qz16

Copy DOI

Publication Date: Dec 29, 2020

#Visual Voice Activity Detection #Facial Landmarks + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. VVAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing VVAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets inthe-wild – WildVVAD – based on combining A-VAD with face detection and tracking. A thorough empirical evaluation showsthe advantage of training the proposed deep V-VAD models with this dataset.

Full Text