Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Lili Wei,Shidi Chen,Congyan Lang,Liqian Liang,Songhe Feng,Tao Wang

doi:10.1145/3506716

Abstract

Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Intelligent Systems and Technology

Lead the way for us

Journal: ACM Transactions on Intelligent Systems and Technology	Publication Date: Mar 3, 2022
Citations: 8

Similar Papers

SEMANTIC MOTION CONCEPT RETRIEVAL IN NON-STATIC BACKGROUND UTILIZING SPATIAL-TEMPORAL VISUAL INFORMATION
Dianting Liu ... Mei-Ling Shyu
International Journal of Semantic Computing | VOL. 07
Dianting Liu, et. al.Dianting Liu ... Mei-Ling Shyu
01 Mar 2013
International Journal of Semantic Computing | VOL. 07

Human Action Recognition by Learning Spatio-Temporal Features with Deep Neural Networks
P Haindavi ... Anuj Kumar
E3S Web of Conferences | VOL. 430
P Haindavi, et. al.P Haindavi ... Anuj Kumar
01 Jan 2023
E3S Web of Conferences | VOL. 430

Video Inference for Human Mesh Recovery with Vision Transformer
Hanbyel Cho ... Yooshin Cho
-
Hanbyel Cho, et. al.Hanbyel Cho ... Yooshin Cho
05 Jan 2023
05 Jan 2023

FusionFormer: A Concise Unified Feature Fusion Transformer for 3D Pose Estimation
Yanlu Cai ... Cheng Jin
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Yanlu Cai, et. al.Yanlu Cai ... Cheng Jin
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Intelligent Systems and Technology