Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.

Yujia Zhang,Qianzhong Li,Yi Pan,Xiaoguang Zhao,Min Tan

doi:10.1109/tip.2024.3394260

Abstract

Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Lead the way for us

Similar Papers

Real-time Detection of Tiny Objects Based on a Weighted Bi-directional FPN
Yaxuan Hu ... Zhongxiang Wang
-
Yaxuan Hu, et. al.Yaxuan Hu ... Zhongxiang Wang
01 Jan 2021
01 Jan 2021

An Object Detection Algorithm Based on Attention Mechanism and Lightweight Network (AMLN)
Xuemei Yuan ... Zhengfeng Jiang
-
Xuemei Yuan, et. al.Xuemei Yuan ... Zhengfeng Jiang
08 May 2020
08 May 2020

Overview of object detection based on deep learning
Dantong Zhang
Applied and Computational Engineering | VOL. 104
Dantong ZhangDantong Zhang
08 Nov 2024
Applied and Computational Engineering | VOL. 104

One-stage object detection with graph convolutional networks
Lijun Du ... Junyu Dong
-
Lijun Du, et. al.Lijun Du ... Junyu Dong
27 Jan 2021
27 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society