Learning spatiotemporal lip dynamics in 3D point cloud stream for visual voice activity detection

Jie Zhang,Jingyi Cao,Junhua Sun

doi:10.1016/j.bspc.2023.105410

Abstract

Visual voice activity detection (V-VAD) aims at identifying speech activities and measuring start/end moments in visual signal streams generated from the speaking-related biological modality. The visual cue is important counterpart of audio modality in noisy scenarios. Most of existing methods rely on 2D sequences, which suffer from limited robustness to variations of illumination and facial pose. This paper proposes a 3D V-VAD method based on lip dynamics modeling and learning spatiotemporal speaking lip representation in point cloud streams. We construct a lightweight temporal speaking dynamics model DualDyn to represent dual dynamics of global lip structure and local lip context in 4D spatiotemporal domain. The activity moment detection is formulated as frame-level state classification and refined by behavior continuity. Our method was verified on a public dataset S3DFM and achieves a state-of-the-art accuracy of 22.6ms and higher efficiency in measuring activity moments. Ablation studies demonstrate the robustness and applicability of our method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Learning spatiotemporal lip dynamics in 3D point cloud stream for visual voice activity detection

Abstract

Talk to us

Similar Papers

More From: Biomedical Signal Processing and Control

Lead the way for us

Similar Papers

A visual voice activity detection method with adaboosting
Qingju Liu ... Wenwu Wang
-
Qingju Liu, et. al. Qingju Liu ... Wenwu Wang
01 Jan 2010
01 Jan 2010

Face landmark point tracking using LK pyramid optical flow
Gang Zhang ... Jiaquan Li
-
Gang Zhang, et. al.Gang Zhang ... Jiaquan Li
13 Apr 2018
13 Apr 2018

Multi-color ULBP with wavelet transform in invariant pose face recognition
Seyed Omid Shahdi ... S A R Abu-Bakar
-
Seyed Omid Shahdi, et. al.Seyed Omid Shahdi ... S A R Abu-Bakar
01 Nov 2011
01 Nov 2011

Experimental Evaluation of 3D Kinect Face Database
A A Gaonkar ... N T Vetrekar
-
A A Gaonkar, et. al.A A Gaonkar ... N T Vetrekar
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning spatiotemporal lip dynamics in 3D point cloud stream for visual voice activity detection

Abstract

Talk to us

Similar Papers

More From: Biomedical Signal Processing and Control