Abstract

Many methods focus on aligning image regions with the corresponding text fragments, and ignore that images contain fragments that cannot be expressed by texts. To fully express the information of images and prevent the performance degradation caused by fine-grained information deviating from the core meaning of images, we propose Semantic-embedding Guided Graph Network (SGGN) for cross-modal retrieval. It learns the detail representations of each modality, with an integrated semantic information, by guiding local fragments to capture the internal correlation of cross-modal data and effectively convey the information. To further bridge the semantic gap between different modalities, SGGN uses adversarial network to play a game, and uses graph aggregation network to absorb complementary information of neighbor samples. We evaluate our approach on two datasets. Our method (based on R@10) achieves 97.2% on Flickr30k dataset. On MS-COCO dataset, it reaches 99.2% using 1K test set and 92.0% using 5K test set.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.