A picture is worth a thousand words, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in a single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent’s background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this article, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we introduce a baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge Entity Linker) for the solution of the VTKEL task. We evaluate the performances of VT-LinKEr on both datasets. We then contribute a supervised algorithm called ViTKan (Visual-Textual-Knowledge Alignment Network). We trained the ViTKan algorithm using features data of the VTKEL1k* dataset. The experimental results on VTKEL1k* and VTKEL30k datasets show that ViTKan substantially outperforms the baseline algorithm.
Read full abstract