Learning Complementary Spatial-Temporal Transformer for Video Salient Object Detection.

Nian Liu,Kepan Nan,Wangbo Zhao,Xiwen Yao,Junwei Han

doi:10.1109/tnnls.2023.3243246

Abstract

Besides combining appearance and motion information, another crucial factor for video salient object detection (VSOD) is to mine spatial-temporal (ST) knowledge, including complementary long-short temporal cues and global-local spatial context from neighboring frames. However, the existing methods only explored part of them and ignored their complementarity. In this article, we propose a novel complementary ST transformer (CoSTFormer) for VSOD, which has a short-global branch and a long-local branch to aggregate complementary ST contexts. The former integrates the global context from the neighboring two frames using dense pairwise attention, while the latter is designed to fuse long-term temporal information from more consecutive frames with local attention windows. In this way, we decompose the ST context into a short-global part and a long-local part and leverage the powerful transformer to model the context relationship and learn their complementarity. To solve the contradiction between local window attention and object motion, we propose a novel flow-guided window attention (FGWA) mechanism to align the attention windows with object and camera movements. Furthermore, we deploy CoSTFormer on fused appearance and motion features, thus enabling the effective combination of all three VSOD factors. Besides, we present a pseudo video generation method to synthesize sufficient video clips from static images for training ST saliency models. Extensive experiments have verified the effectiveness of our method and illustrated that we achieve new state-of-the-art results on several benchmark datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Learning Complementary Spatial-Temporal Transformer for Video Salient Object Detection.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems

Lead the way for us

Journal: IEEE transactions on neural networks and learning systems	Publication Date: Aug 1, 2024
Citations: 7

Similar Papers

A Novel Long-Term Iterative Mining Scheme for Video Salient Object Detection
Chenglizhao Chen ... Hengsen Wang
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32
Chenglizhao Chen, et. al.Chenglizhao Chen ... Hengsen Wang
01 Nov 2022
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32

Attention Embedded Spatio-Temporal Network for Video Salient Object Detection
Lili Huang ... Liang Lin
IEEE Access | VOL. 7
Lili Huang, et. al.Lili Huang ... Liang Lin
01 Jan 2019
IEEE Access | VOL. 7

Collaborative spatial-temporal video salient object detection with cross attention transformer
Yuting Su ... Peiguang Jing
Signal Processing | VOL. 224
Yuting Su, et. al.Yuting Su ... Peiguang Jing
09 Jul 2024
Signal Processing | VOL. 224

RETRACTED ARTICLE: IoT-based 3D convolution for video salient object detection
Shizhou Dong ... Heye Zhang
Neural Computing and Applications | VOL. 32
Shizhou Dong, et. al.Shizhou Dong ... Heye Zhang
02 Jan 2019
Neural Computing and Applications | VOL. 32

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning Complementary Spatial-Temporal Transformer for Video Salient Object Detection.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems