Eye-movement-prompted large image captioning model

Zheng Yang,Bing Han,Xinbo Gao,Zhi-Hui Zhan

doi:10.1016/j.patcog.2024.111097

Abstract

Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, i.e., 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Eye-movement-prompted large image captioning model

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Similar Papers

Hybrid attention network for image captioning
Wenhui Jiang ... Fei Shen
Displays | VOL. 73
Wenhui Jiang, et. al.Wenhui Jiang ... Fei Shen
13 May 2022
Displays | VOL. 73

An image caption method based on object detection
Danyang Cao ... Menggui Zhu
Multimedia Tools and Applications | VOL. 78
Danyang Cao, et. al.Danyang Cao ... Menggui Zhu
03 Sep 2019
Multimedia Tools and Applications | VOL. 78

Context-Aware Visual Policy Network for Fine-Grained Image Captioning
Zheng-Jun Zha ... Hanwang Zhang
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 44
Zheng-Jun Zha, et. al.Zheng-Jun Zha ... Hanwang Zhang
09 Apr 2019
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 44

Iconographic Image Captioning for Artworks
Eva Cetinic
-
Eva CetinicEva Cetinic
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Eye-movement-prompted large image captioning model

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition