Decoupling Multimodal Transformers for Referring Video Object Segmentation

Mingqi Gao,Feng Zheng,Jungong Han,Giovanni Montana,Jinyu Yang,Ke Lu

doi:10.1109/tcsvt.2023.3284979

Abstract

Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, most efforts focus on the interaction between visual and textual features. However, they ignore the syntactic structures of the text during the interaction, where all textual components are intertwined, resulting in ambiguous vision-language alignment. In this paper, we improve the multimodal interaction by DECOUPLING the interweave. Specifically, we train a lightweight subject perceptron, which extracts the subject part from the input text. Then, the subject and text features are fed into two parallel branches to interact with visual features. This enables us to perform subject-aware and context-aware interactions, respectively, thus encouraging more explicit and discriminative feature embedding and alignment. Moreover, we find the decoupled architecture also facilitates incorporating the vision-language pre-trained alignment into RVOS, further improving the segmentation performance. Experimental results on all RVOS benchmark datasets demonstrate the superiority of our proposed method over the state-of-the-arts. The code of our method is available at: https://github.com/gaomingqi/dmformer.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Decoupling Multimodal Transformers for Referring Video Object Segmentation

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society

Lead the way for us

Journal: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society	Publication Date: Sep 1, 2023
Citations: 1

Similar Papers

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search.
Shuting He ... Henghui Ding
IEEE Transactions on Image Processing | VOL. PP
Shuting He, et. al.Shuting He ... Henghui Ding
01 Jan 2024
IEEE Transactions on Image Processing | VOL. PP

The Symbolist Conception of Illustration and Tyra Kleen’s Nevermore
Birte Bruchmüller
The Edgar Allan Poe Review | VOL. 22
Birte BruchmüllerBirte Bruchmüller
01 Jun 2021
The Edgar Allan Poe Review | VOL. 22

A heterogenous automatic feedback semi-supervised method for image reranking
Xin-Chao Xu ... Xin-Shun Xu
-
Xin-Chao Xu, et. al.Xin-Chao Xu ... Xin-Shun Xu
01 Jan 2013
01 Jan 2013

End-to-end multimodal clinical depression recognition using deep neural networks: A comparative analysis
Muhammad Muzammel ... Alice Othmani
Computer methods and programs in biomedicine | VOL. 211
Muhammad Muzammel, et. al.Muhammad Muzammel ... Alice Othmani
28 Sep 2021
Computer methods and programs in biomedicine | VOL. 211

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Decoupling Multimodal Transformers for Referring Video Object Segmentation

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society