Abstract

Semantic relationship information is important to the image-text retrieval task. Existing work usually extract relationship information by calculating the relationship value pairwise, which is hardly to find out a meaningful semantic relationship. A more reasonable method is to convert the modal to a scene graph, thereby explicitly modeling the relationship. Scene graph is a kind of graph data structure modeling the scene of modality. There are two concept in a scene graph, object and relationship. In image modal, object indicates the image region and relationship represents the predicate of the image regions. In text modal, object indicates the entity and relationship represents the association between entities, also known as semantic relationship. In image-text retrieval task, both object and relationship are important, and a key challenge is to obtain semantic information. In this paper, image and text are represented as two kinds of scene graphs: visual scene graph and textual scene graph, and then they are combined into Heterogeneous Scene Graph(HSG). By explicitly modeling relationships using directed graph, the information can be passed edge-wise. To further extract semantic information, we introduce the metapath, which can extract specific semantic information on specified path. Moreover, we propose Heterogeneous Message Passing(HMP) to communicate information on the metapath. After the message passing, the similarity of two modalities can be represented as the similarity of the graphs. Experiment shows that the model achieve competitive results on Flickr30K and MSCOCO, which indicates that our approach has advantages in image-text retrieval.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call