Abstract

Referring expression comprehension and segmentation aim to locate and segment a referred instance in an image according to a natural language expression. However, existing methods tend to ignore the interaction between visual and language modalities for visual feature learning, and establishing a synergy between the visual and language modalities remains a considerable challenge. To tackle the above problems, we propose a novel end-to-end framework, Cross-Modality Synergy Network (CMS-Net), to address the two tasks jointly. In this work, we propose an attention-aware representation learning module to learn modal representations for both images and expressions. A language self-attention submodule is proposed in this module to learn expression representations by leveraging the intra-modality relations, and a language-guided channel-spatial attention submodule is introduced to obtain the language-aware visual representations under language guidance, which helps the model pay more attention to the referent-relevant regions in the images and relieve background interference. Then, we design a cross-modality synergy module to establish the inter-modality relations for modality fusion. Specifically, a language-visual similarity is obtained at each position of the visual feature map, and the synergy is achieved between the two modalities in both semantic and spatial dimensions. Furthermore, we propose a multi-scale feature fusion module with a selective strategy to aggregate the important information from multi-scale features, yielding target results. We conduct extensive experiments on four challenging benchmarks, and our framework achieves significant performance gains over state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.