Towards Generating and Evaluating Iconographic Image Captions of Artworks

Eva Cetinic

doi:10.3390/jimaging7080123

Abstract

To automatically generate accurate and meaningful textual descriptions of images is an ongoing research challenge. Recently, a lot of progress has been made by adopting multimodal deep learning approaches for integrating vision and language. However, the task of developing image captioning models is most commonly addressed using datasets of natural images, while not many contributions have been made in the domain of artwork images. One of the main reasons for that is the lack of large-scale art datasets of adequate image-text pairs. Another reason is the fact that generating accurate descriptions of artwork images is particularly challenging because descriptions of artworks are more complex and can include multiple levels of interpretation. It is therefore also especially difficult to effectively evaluate generated captions of artwork images. The aim of this work is to address some of those challenges by utilizing a large-scale dataset of artwork images annotated with concepts from the Iconclass classification system. Using this dataset, a captioning model is developed by fine-tuning a transformer-based vision-language pretrained model. Due to the complex relations between image and text pairs in the domain of artwork images, the generated captions are evaluated using several quantitative and qualitative approaches. The performance is assessed using standard image captioning metrics and a recently introduced reference-free metric. The quality of the generated captions and the model’s capacity to generalize to new data is explored by employing the model to another art dataset to compare the relation between commonly generated captions and the genre of artworks. The overall results suggest that the model can generate meaningful captions that indicate a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.

Highlights

Image captioning refers to the task of generating a short text that describes the content of an image based only on the image input
The aim of this work is to address some of those challenges by utilizing a large-scale dataset of artwork images annotated with concepts from the Iconclass classification system
The Iconclass Caption test set contains 5192 images, but the reported CLIP-S and RefCLIP-S values are calculated only on a subset of 4928 images where the generated captions are shorter than 76 tokens, together with tokens that indicate the end and beginning of the text sequence. This was carried out because the CLIP model, which serves as a basis for the CLIPScore metric, was trained with the maximal textual sequence length set at 76 tokens

Summary

Introduction

Image captioning refers to the task of generating a short text that describes the content of an image based only on the image input This usually implies recognizing objects and their relationships in an image. Image captioning in the context of natural images is usually performed at the level of “pre-iconographic” descriptions, which implies describing the content and listing the objects that are depicted in an image. For artwork images this type of description represents only the most basic level of visual understanding and is not considered to be useful for performing multimodal analysis and retrieval within art collections

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Imaging	Publication Date: Jul 23, 2021
Citations: 18	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Towards Generating and Evaluating Iconographic Image Captions of Artworks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Imaging

Lead the way for us

Similar Papers

Iconographic Image Captioning for Artworks
Eva Cetinic
-
Eva CetinicEva Cetinic
01 Jan 2020
01 Jan 2020

Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning
Zhengyuan Zhang ... Wenkai Zhang
IEEE Transactions on Geoscience and Remote Sensing | VOL. 60
Zhengyuan Zhang, et. al.Zhengyuan Zhang ... Wenkai Zhang
01 Jan 2021
IEEE Transactions on Geoscience and Remote Sensing | VOL. 60

Object-aware semantics of attention for image captioning
Shiwei Wang ... Long Lan
Multimedia Tools and Applications | VOL. 79
Shiwei Wang, et. al.Shiwei Wang ... Long Lan
14 Nov 2019
Multimedia Tools and Applications | VOL. 79

Consensus Graph Representation Learning for Better Grounded Image Captioning
Wenqiao Zhang ... Yueting Zhuang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35
Wenqiao Zhang, et. al.Wenqiao Zhang ... Yueting Zhuang
18 May 2021
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards Generating and Evaluating Iconographic Image Captions of Artworks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Imaging