Abstract

Scene graph is a graph representation that explicitly represents high-level semantic knowledge of an image such as objects, attributes of objects and relationships between objects. Various tasks have been proposed for the scene graph, but the problem is that they have a limited vocabulary and biased information due to their own hypothesis. Therefore, results of each task are not generalizable and difficult to be applied to other down-stream tasks. In this paper, we propose Entity Synset Alignment(ESA), which is a method to create a general scene graph by aligning various semantic knowledge efficiently to solve this bias problem. The ESA uses a large-scale lexical database, WordNet and Intersection of Union (IoU) to align the object labels in multiple scene graphs/semantic knowledge. In experiment, the integrated scene graph is applied to the image-caption retrieval task as a down-stream task. We confirm that integrating multiple scene graphs helps to get better representations of images.

Highlights

  • Beyond detecting and recognizing individual objects, research for understanding visual scenes is moving toward extracting semantic knowledge to create scene graph from natural images

  • We propose Entity Synset Alignment (ESA) to perform scene graph integration

  • Entity Synset Alignment(ESA) integrates scene graphs generated from each dataset

Read more

Summary

Introduction

Beyond detecting and recognizing individual objects, research for understanding visual scenes is moving toward extracting semantic knowledge to create scene graph from natural images. Starting with (Krishna et al, 2017), various studies have been proposed to generate this semantic knowledge from images (Zellers et al, 2018; Xu et al, 2017; Liang et al, 2019; Anderson et al, 2018). In (Anderson et al, 2018), the author conducted a study on extracting information of both object and attribute for each entity using 1,600 object and 400. (Zellers et al, 2018; Xu et al, 2017) generate a relationship between objects in a form of triplet (head entity predicate - tail entity) in an image by using 150 object and 50 predicate class labels. In (Liang et al, 2019), the author constructed a Visually-Relevant Relationships(VrR-VG) based on (Krishna et al, 2017) to mine more valuable relationships with 1600 objects and 117 predicate class labels. There are cases where the same object is defined with different vocabulary in a common image (e.g. man, person)

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.