Visually Guided Sound Source Separation With Audio-Visual Predictive Coding.

Zengjie Song,Zhaoxiang Zhang

doi:10.1109/tnnls.2023.3288022

Abstract

The framework of visually guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such a divide-and-conquer paradigm is parameter-inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this article presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter-efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding (PC)-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via copredicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Visually Guided Sound Source Separation With Audio-Visual Predictive Coding.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems

Lead the way for us

Similar Papers

MFUR-Net: Multimodal feature fusion and unimodal feature refinement for RGB-D salient object detection
Zhengqian Feng ... Mingle Zhou
Knowledge-Based Systems | VOL. 299
Zhengqian Feng, et. al.Zhengqian Feng ... Mingle Zhou
31 May 2024
Knowledge-Based Systems | VOL. 299

Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action Units
Zhilei Liu ... Chongke Bi
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 20
Zhilei Liu, et. al.Zhilei Liu ... Chongke Bi
23 Sep 2024
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 20

Adaptive Multimodal Fusion With Attention Guided Deep Supervision Net for Grading Hepatocellular Carcinoma.
Shangxuan Li ... Guangyi Wang
IEEE Journal of Biomedical and Health Informatics | VOL. 26
Shangxuan Li, et. al.Shangxuan Li ... Guangyi Wang
01 Aug 2022
IEEE Journal of Biomedical and Health Informatics | VOL. 26

Using ALBERT and Multi-modal Circulant Fusion for Fake News Detection
Xingang Wang ... Xiaoyu Liu
-
Xingang Wang, et. al.Xingang Wang ... Xiaoyu Liu
09 Oct 2022
09 Oct 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Visually Guided Sound Source Separation With Audio-Visual Predictive Coding.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems