Self-Paced Multi-Grained Cross-Modal Interaction Modeling for Referring Expression Comprehension.

Peihan Miao,Xuewei Li,Wei Su,Li Xi,Gaoang Wang

doi:10.1109/tip.2023.3334099

Abstract

As an important and challenging problem in vision-language tasks, referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning. In addition, due to the diversity of visual scenes and the variation of linguistic expressions, some hard examples have much more abundant multi-grained information than others. How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task. To address aforementioned challenges, in this paper, we propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability through innovations in network structure and learning mechanism. Concretely, we design a transformer-based multi-grained cross-modal attention, which effectively utilizes the inherent multi-grained information in visual and linguistic encoders. Furthermore, considering the large variance of samples, we propose a self-paced sample informativeness learning to adaptively enhance the network learning for samples containing abundant multi-grained information. The proposed framework significantly outperforms state-of-the-art methods on widely used datasets, such as RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame datasets, demonstrating the effectiveness of our method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Self-Paced Multi-Grained Cross-Modal Interaction Modeling for Referring Expression Comprehension.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Lead the way for us

Journal: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society	Publication Date: Jan 1, 2024
Citations: 1

Similar Papers

Language-Attention Modular-Network for Relational Referring Expression Comprehension in Videos
Naina Dhingra ... Shipra Jain
-
Naina Dhingra, et. al.Naina Dhingra ... Shipra Jain
21 Aug 2022
21 Aug 2022

Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks
Jia Wang ... Wen-Huang Cheng
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 19
Jia Wang, et. al.Jia Wang ... Wen-Huang Cheng
06 Feb 2023
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 19

IHEM Loss: Intra-Class Hard Example Mining Loss for Robust Face Recognition
Degui Xiao ... Jianfang Li
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32
Degui Xiao, et. al.Degui Xiao ... Jianfang Li
01 Nov 2022
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 32

Linguistically-aware attention for reducing the semantic gap in vision-language tasks
Gouthaman Kv ... Anurag Mittal
Pattern Recognition | VOL. 112
Gouthaman Kv, et. al.Gouthaman Kv ... Anurag Mittal
01 Jan 2020
Pattern Recognition | VOL. 112

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Self-Paced Multi-Grained Cross-Modal Interaction Modeling for Referring Expression Comprehension.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society