Abstract

When looking at an image, humans shift their attention toward interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem intuitively, we develop a visual analytics system, CapVis, to look into visual attention and image captioning, two types of subjective annotations that are relatively task-free and natural. Using these annotations, we propose a word-weighting scheme to extract visual and verbal saliency ranks to compare against each other. In our approach, a number of low-level and semantic-level features relevant to visual-verbal saliency consistency are proposed and visualized for a better understanding of image content. Our method also shows the different ways that a human and a computational model look at and describe images, which provides reliable information for a captioning model. Experiment also shows that the visualized feature can be integrated into a computational model to effectively predict the consistency between the two modalities on an image dataset with both types of annotations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call