Abstract

To understand the content of a document containing both text and pictures, an artificial agent needs to jointly recognize the entities shown in the pictures and mentioned in the text, and to link them to its background knowledge. This is a complex task, that we call Visual-Textual-Knowledge Entity Linking (VTKEL), which aims at linking visual and textual entity mentions to the corresponding entity (or a newly created one) of the agent knowledge base. Solving the VTKEL task opens a wide range of opportunities for improving semantic visual interpretation. For instance, given the effectiveness and robustness of state-of-the-art NLP technologies in entity linking, by automatically linking visual and textual mentions of the same entities with the ontology, we can obtain a huge amount of automatically annotated images with detailed categories. In this paper, we propose the VTKEL dataset, consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The VTKEL dataset can be used for training and evaluating algorithms for visual-textual-knowledge entity linking.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call