Abstract

Referring Image Segmentation (RIS) aims to extract the object or stuff from an image according to the given natural language expression. As a representative multi-modal reasoning task, the main challenge of RIS lies in accurately understanding and aligning two types of heterogeneous data (i.e. image and text). Existing methods typically conduct this task via inexplicit cross-modal fusion toward the visual and linguistic features, which are separately extracted from different encoders and hard to capture accurate image–text alignment information due to their distinct latent representation structures. In this paper, we propose a Dual-Graph Hierarchical Interaction Network (DGHIN) to facilitate the explicit and comprehensive alignment between the image and text data. Firstly, two graphs are separately built for the initial visual and linguistic features extracted with different pre-trained encoders. By means of graph reasoning, we obtain a unified representation structure for different modalities to capture the intra-modal entities and their contexts, where each projected node incorporates the long-range dependencies into the latent representation. Then, the Hierarchical Interaction Module (HIM) is applied to the visual and linguistic graphs to extract comprehensive inter-modal correlations from the entity level and graph level, which not only capture the corresponding keywords and visual patches but also draws the whole sentence closer to the image region with the consistent context in the latent space. Extensive experiments on RefCOCO, RefCOCO+, G-Ref, and ReferIt demonstrate that the proposed DGHIN outperforms many state-of-the-art methods. Code is available at https://github.com/ZhaofengSHI/referring-DGHIN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call