Abstract

The task of image difference captioning aims at locating changed objects in similar image pairs and describing the difference with natural language. The key challenges of this task are to comprehend the context of image pairs sufficiently and locate the changed objects accurately in the presence of viewpoint change. Previous studies focus on pixel-level image features, neglecting rich explicit features of objects in an image pair which are beneficial to generate a fine-grained difference caption. Additionally, existing generative models suffer from accurately locate the differences in the interference of viewpoint change. To address these issues, we propose an Instance-Level Fine-Grained Difference Captioning (IFDC) model, which consists of a fine-grained feature extraction module, a multi-round feature fusion module, a similarity-based difference finding module, and a difference captioning module. To describe the changed objects comprehensively, we extract the fine-grained features, i.e., visual features, semantic features, and positional features at instance-level, as the objects’ representation. To enhance the model’s immunity to viewpoint change, we design a similarity-based difference finding module to locate the changed objects accurately. Extensive experiments show that our IFDC model achieves comparable performance with the state-of-the-art models on the datasets of CLEVR-Change and Spot-the-Diff, thus verifying the effectiveness of our proposed model. Our source code is available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/VISLANG-Lab/IFDC</uri> .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call