Abstract
Multimedia feature graphs are employed to represent features of images, video, audio, or text. Various techniques exist to extract such features from multimedia objects. In this paper, we describe the extension of such a feature graph to represent the meaning of such multimedia features and introduce a formal context-free PS-grammar (Phrase Structure grammar) to automatically generate human-understandable natural language expressions based on such features. To achieve this, we define a semantic extension to syntactic multimedia feature graphs and introduce a set of production rules for phrases of natural language English expressions. This explainability, which is founded on a semantic model provides the opportunity to represent any multimedia feature in a human-readable and human-understandable form, which largely closes the gap between the technical representation of such features and their semantics. We show how this explainability can be formally defined and demonstrate the corresponding implementation based on our generic multimedia analysis framework. Furthermore, we show how this semantic extension can be employed to increase the effectiveness in precision and recall experiments.
Highlights
Introduction and MotivationBridging the semantic gap has been a research goal for many years
We present a solution for the automated explainability of Multimedia Information Retrieval (MMIR) processing steps in the form of human-understandable natural language texts based on a semantic modeling, which supports inferencing and reasoning
This combination leads to a well-defined semantic representation of MMFGs, Semantic Multimedia Feature Graph (SMMFG), and to the Explainable Semantic Multimedia Feature Graph (ESMMFG)
Summary
Bridging the semantic gap has been a research goal for many years. Narrowing down the gap between detected features from multimedia assets (i.e., images, video, audio, text, and social media) and their semantic representation has led to numerous investigations and research in the field of Multimedia Information Retrieval (MMIR) [1]. Due to higher resolutions of images and video, the Level-Of-Detail (LOD) in multimedia assets has increased significantly Current professional cameras such as the Sony α7R IV 35 are equipped with a resolution of 61.0 megapixels [3], and smartphones such as the Xiaomi Redmi Note 10 Pro even push that boundary to 108 megapixels [4]. This high LOD is reflected by other multimedia types, e.g., text, where news agencies maintain huge archives of textual information, enriched by user comments, web information, or social media [5]. We present a solution for the automated explainability of MMIR processing steps in the form of human-understandable natural language texts based on a semantic modeling, which supports inferencing and reasoning
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have