Abstract
Visual voice activity detection (V-VAD) aims at identifying speech activities and measuring start/end moments in visual signal streams generated from the speaking-related biological modality. The visual cue is important counterpart of audio modality in noisy scenarios. Most of existing methods rely on 2D sequences, which suffer from limited robustness to variations of illumination and facial pose. This paper proposes a 3D V-VAD method based on lip dynamics modeling and learning spatiotemporal speaking lip representation in point cloud streams. We construct a lightweight temporal speaking dynamics model DualDyn to represent dual dynamics of global lip structure and local lip context in 4D spatiotemporal domain. The activity moment detection is formulated as frame-level state classification and refined by behavior continuity. Our method was verified on a public dataset S3DFM and achieves a state-of-the-art accuracy of 22.6ms and higher efficiency in measuring activity moments. Ablation studies demonstrate the robustness and applicability of our method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.