Abstract

Change captioning aims to describe the differences in image pairs with natural language. It is an interesting task under-explored with two main challenges: describing the relative position relationship between objects correctly and overcoming the disturbances from viewpoint changes. To address these issues, we propose a three-dimensional (3D) information aware Scene Graph based Change Captioning (SGCC) model. We extract the semantic attributes of objects and the 3D information of images (i.e., depths of objects, relative two-dimensional image plane distances, and relative angles between objects) to construct the scene graphs for image pairs, then aggregate the nodes representations with a graph convolutional network. Owing to the relative position relationships between objects and the scene graphs, our model thereby is capable of assisting observers to locate the changed objects quickly and being immune to the viewpoint change to some extent. Extensive experiments show that our SGCC model achieves competitive performance with the state-of-the-art models on the CLEVR-Change and Spot-the-Diff datasets, thus verifying the effectiveness of our proposed model. Codes are available at https://github.com/VISLANG-Lab/SGCC.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call