Abstract
Many methods focus on aligning image regions with the corresponding text fragments, and ignore that images contain fragments that cannot be expressed by texts. To fully express the information of images and prevent the performance degradation caused by fine-grained information deviating from the core meaning of images, we propose Semantic-embedding Guided Graph Network (SGGN) for cross-modal retrieval. It learns the detail representations of each modality, with an integrated semantic information, by guiding local fragments to capture the internal correlation of cross-modal data and effectively convey the information. To further bridge the semantic gap between different modalities, SGGN uses adversarial network to play a game, and uses graph aggregation network to absorb complementary information of neighbor samples. We evaluate our approach on two datasets. Our method (based on R@10) achieves 97.2% on Flickr30k dataset. On MS-COCO dataset, it reaches 99.2% using 1K test set and 92.0% using 5K test set.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have