Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point

Peizhi Zhao,Dongsheng Xu,Shiyi Zheng,Pijian Li,Qingbao Huang,Wenye Zhao,Yi Cai

doi:10.1609/aaai.v38i7.28580

Abstract

As a fundamental and challenging task in the vision and language domain, Referring Expression Comprehension (REC) has shown impressive improvements recently. However, for a complex task that couples the comprehension of abstract concepts and the localization of concrete instances, one-stage approaches are bottlenecked by computing and data resources. To obtain a low-cost solution, the prevailing two-stage approaches decouple REC into localization (region proposal) and comprehension (region-expression matching) at region-level, but the solution based on isolated regions cannot sufficiently utilize the context and is usually limited by the quality of proposals. Therefore, it is necessary to rebuild an efficient two-stage solution system. In this paper, we propose a point-based two-stage framework for REC, in which the two stages are redefined as point-based cross-modal comprehension and point-based instance localization. Specifically, we reconstruct the raw bounding box and segmentation mask into center and mass scores as soft ground-truth for measuring point-level cross-modal correlations. With the soft ground-truth, REC can be approximated as a binary classification problem, which fundamentally avoids the impact of isolated regions on the optimization process. Remarkably, the consistent metrics between center and mass scores allow our system to directly optimize grounding and segmentation by utilizing the same architecture. Experiments on multiple benchmarks show the feasibility and potential of our point-based paradigm. Our code available at https://github.com/VILAN-Lab/PBREC-MT.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 2

Similar Papers

A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension
Yue Liao ... Chen Qian
-
Yue Liao, et. al.Yue Liao ... Chen Qian
01 Jun 2020
01 Jun 2020

A Fast and Accurate One-Stage Approach to Visual Grounding
Zhengyuan Yang ... Jiebo Luo
-
Zhengyuan Yang, et. al.Zhengyuan Yang ... Jiebo Luo
01 Oct 2019
01 Oct 2019

Bottom-Up and Bidirectional Alignment for Referring Expression Comprehension
Liuwu Li ... Yi Cai
-
Liuwu Li, et. al.Liuwu Li ... Yi Cai
17 Oct 2021
17 Oct 2021

Understanding Synonymous Referring Expressions via Contrastive Features
Yi-Wen Chen ... Ming-Hsuan Yang
International Journal of Computer Vision | VOL. 130
Yi-Wen Chen, et. al.Yi-Wen Chen ... Ming-Hsuan Yang
09 Aug 2022
International Journal of Computer Vision | VOL. 130

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence