Abstract

Referring image segmentation aims to segment the instance corresponding to the given language description, which requires aligning information from two modalities. Existing approaches usually align the cross-modal information based on different forms of feature units, such as pixel-sentence, pixel-word and patch-word. The problem is that the semantic information embodied by these feature units may be mismatched, for example, the semantics transferred by a pixel is a part of the semantics of a sentence. When using this inconsistent information to model the relationship between feature units from two modalities, the obtained relationship between the modes are imprecise, resulting in inaccurate cross-modal features. In this paper, we propose to generate scalable area and keywords features to ensure that the feature units from the two modalities have comparable semantic granularity. Meanwhile, the scalable features provide sparse representation for image and text, which reduces computation complexity for computing cross-modal features. In addition, we design a novel multi-source driven dynamic convolution to inversely map the area-keywords cross-modal features to image to predicate mask. The experimental results on three benchmark datasets demonstrate that our proposed framework achieves advanced performance, and the calculation amount of the model has been greatly reduced.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call