Abstract

Generation of scene graphs and natural language captions from images for deep image understanding is an ongoing research problem. Scene graphs and natural language captions have a common characteristic in that they are generated by considering the objects in the images and the relationships between the objects. This study proposes a deep neural network model named the Context-based Captioning and Scene Graph Generation Network (C2SGNet), which simultaneously generates scene graphs and natural language captions from images. The proposed model generates results through communication of context information between these two tasks. For effective communication of context information, the two tasks are structured into three layers: the object detection, relationship detection, and caption generation layers. Each layer receives related context information from the lower layer. In this study, the proposed model was experimentally assessed using the Visual Genome benchmark data set. The performance improvement effect of the context information was verified through various experiments. Further, the high performance of the proposed model was confirmed through performance comparison with existing models.

Highlights

  • Image understanding is one of the core elements of computer vision and has been extensively researched

  • This seems to be because the Relationship Context Network (RCN) delivers the context information to the relationship detection process, which is a core element of the scene graph procedures, whereas the Caption Context Network (CCN) delivers it to the caption generation process

  • This paper suggested the C2SGNET deep neural network model, which can simultaneously generate scene graphs and natural language captions from input images for high-level image understanding

Read more

Summary

Introduction

Image understanding is one of the core elements of computer vision and has been extensively researched. Previous studies have focused on superficial information such as identification of objects included in images and their locations These data are insufficient for expressing image content and have, been used as basic modules for solving complex image understanding problems such as visual question answering (VQA) [2,3,4] and referring expression comprehension [5]. Compared to natural language sentences that may be vague, scene graphs can clearly express the relationships among the objects, which are the core elements of image scenes. For high-level image understanding, the present study proposes the Context-based Captioning and Scene Graph Generation Network (C2SGNet), which is a deep neural network model that simultaneously generates natural language captions and scene graphs from input images. To analyze the performance of the proposed model, various experiments are conducted using the Visual Genome benchmark dataset [15]

Related Work
Image Captioning and Scene Graph Generation Model
Performance Evaluation
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call