Abstract

The rapidly developing information technology leads to a fast growth of visual and textual contents, and it comes with huge challenges to make correlation and perform crossmedia retrieval between images and sentences. Existing methods mainly explore cross-media correlation from either global-level instances as the whole images and sentences, or local-level fine-grained patches as the discriminative image regions and key words, which ignore the complementary information from the relation between local-level fine-grained patches. Naturally, relation understanding is highly important for learning crossmedia correlation. People focus on not only the alignment between discriminative image regions and key words, but also their relations lying in the visual and textual context. Therefore, in this paper, we propose Multi-level Adaptive Visual-textual Alignment (MAVA) approach with the following contributions. First, we propose cross-media multi-pathway fine-grained network to extract not only the local fine-grained patches as discriminative image regions and key words, but also visual relations between image regions as well as textual relations from the context of sentences, which contain complementary information to exploit fine-grained characteristics within different media types. Second, we propose visual-textual bi-attention mechanism to distinguish the fine-grained information with different saliency from both local and relation levels, which can provide more discriminative hints for correlation learning. Third, we propose cross-media multi-level adaptive alignment to explore global, local and relation alignments. An adaptive alignment strategy is further proposed to enhance the matched pairs of different media types, and discard those misalignments adaptively to learn more precise cross-media correlation. Extensive experiments are conducted to perform image-sentence matching on 2 widely-used cross-media datasets, namely Flickr-30K and MS-COCO, comparing with 10 state-of-the-art methods, which can fully verify the effectiveness of our proposed MAVA approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.