DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Shilong Liu,Hao Zhang,Jun Zhu,Shijia Huang,Yaoyuan Liang,Feng Li,Hang Su,Lei Zhang

doi:10.1609/aaai.v37i2.25261

Abstract

In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from image simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a 1D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries are designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve the performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves 91.04% and 83.51% in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 26, 2023
Citations: 4

Similar Papers

A New Multinetwork Mean Distillation Loss Function for Open-World Domain Incremental Object Detection
Jing Yang ... Kun Yuan
International Journal of Intelligent Systems | VOL. 2023
Jing Yang, et. al.Jing Yang ... Kun Yuan
06 Nov 2023
International Journal of Intelligent Systems | VOL. 2023

Gaze Assisted Visual Grounding
Kritika Johari ... Christopher Tay Zi Tong
-
Kritika Johari, et. al.Kritika Johari ... Christopher Tay Zi Tong
01 Jan 2020
01 Jan 2020

Learning Data Augmentation Strategies for Object Detection
Barret Zoph ... Quoc V Le
-
Barret Zoph, et. al.Barret Zoph ... Quoc V Le
01 Jan 2020
01 Jan 2020

DEEP LEARNING FOR OBJECT DETECTION USING RADAR DATA
A M Reda ... A Moussa
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. X-1/W1-2023
A M Reda, et. al.A M Reda ... A Moussa
05 Dec 2023
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. X-1/W1-2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence