Abstract

Weakly supervised referring expression comprehension (REC) aims to ground target objects in images according to given referring expressions, while the mappings between image regions and referring expressions are unavailable during the model training phase. Existing models typically reconstruct the multimodal relationships to ground targets by utilizing off-the-shelf information, and ignore to further exploit helpful knowledge to enhance the model performance. To address this issue, we propose an adaptive knowledge distillation architecture to enrich the predominant pattern of weakly supervised REC and transfer the target-aware and interaction-aware knowledge from a pre-trained teacher grounder to enhance the grounding performance of the student model. Specifically, in order to encourage the teacher to impart more reliable knowledge, we present a Knowledge Confidence-Based Adaptive Temperature (KCAT) learning approach to learn optimal temperatures to transfer the target-aware and interaction-aware knowledge with higher prediction confidence. Moreover, to urge the student to absorb more helpful knowledge, we introduce a Student Competency-Based Adaptive Weight (SCAW) learning strategy to dynamically integrate the distilled target-aware and interaction-aware knowledge to enhance the student’s grounding certainty. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+, and RefCOCOg, to validate the proposed approach. Experimental results demonstrate that our approach achieves superior performance over state-of-the-art methods with the aid of adaptive knowledge distillation and integration. The code and trained models are available at: https://github.com/dami23/WREC_AdaptiveKD.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call