Abstract

Visual voice activity detection (V-VAD) aims at identifying speech activities and measuring start/end moments in visual signal streams generated from the speaking-related biological modality. The visual cue is important counterpart of audio modality in noisy scenarios. Most of existing methods rely on 2D sequences, which suffer from limited robustness to variations of illumination and facial pose. This paper proposes a 3D V-VAD method based on lip dynamics modeling and learning spatiotemporal speaking lip representation in point cloud streams. We construct a lightweight temporal speaking dynamics model DualDyn to represent dual dynamics of global lip structure and local lip context in 4D spatiotemporal domain. The activity moment detection is formulated as frame-level state classification and refined by behavior continuity. Our method was verified on a public dataset S3DFM and achieves a state-of-the-art accuracy of 22.6ms and higher efficiency in measuring activity moments. Ablation studies demonstrate the robustness and applicability of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call