Abstract

The beneficial, complementary nature of visual and textual information to convey information is widely known, for example, in entertainment, news, advertisements, science, or education. While the complex interplay of image and text to form semantic meaning has been thoroughly studied in linguistics and communication sciences for several decades, computer vision and multimedia research remained on the surface of the problem more or less. An exception is previous work that introduced the two metrics Cross-Modal Mutual Information and Semantic Correlation in order to model complex image-text relations. In this paper, we motivate the necessity of an additional metric called Status in order to cover complex image-text relations more completely. This set of metrics enables us to derive a novel categorization of eight semantic image-text classes based on three dimensions. In addition, we demonstrate how to automatically gather and augment a dataset for these classes from the Web. Further, we present a deep learning system to automatically predict either of the three metrics, as well as a system to directly predict the eight image-text classes. Experimental results show the feasibility of the approach, whereby the predict-all approach outperforms the cascaded approach of the metric classifiers.

Highlights

  • In our digitized world, we are faced with multimodal information on a daily basis in various situations: consumption of news, entertainment, everyday learning or learning in formal education, social media, advertisements, etc

  • The metrics are based on the assumptions that visual and textual information can relate to each other a) based on their depicted or mentioned content, or b) based on their semantic context. We follow this paradigm and present the following contributions: First, we extend this set of two metrics by introducing a third metric called “Status,” which is based on insights from linguistics and communication sciences

  • We have presented a contribution to bridge the semantic gap between visual and textual information, and the gap between research in linguistics and communication science on one side, and multimedia and computer vision research on the other side

Read more

Summary

Introduction

We are faced with multimodal information on a daily basis in various situations: consumption of news, entertainment, everyday learning or learning in formal education, social media, advertisements, etc. In a similar context, bridging the semantic gap has been identified as one of the key challenges in image retrieval (and multimedia) research [44], defined as “the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation.”. One challenge at this point in time was that information extraction from images was limited to low-level features.

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call