Abstract

We propose an algorithm to generate graphical summarising of longer text passages using a set of illustrative pictures (TIPS). TIPS is an algorithm using a voting process that uses results of individual “weak” algorithms. The proposed method includes a summarising algorithm that generates a digest of the input document. Each sentence of the text summary is used as the input for further processing by the sentence transformer separately. A sentence transformer performs text embedding and a group of CLIP similarity-based algorithms trained on different image embedding finds semantic distances between images in the illustration image database and the input text. A voting process extracts the most matching images to the text. The TIPS algorithm allows the integration of the best (highest scored) results of the different recommendation algorithms by diminishing the influence of images that are a disjointed part of the recommendations of the component algorithms. TIPS returns a set of illustrative images that describe each sentence of the text summary. Three human judges found that the use of TIPS resulted in an increase in matching highly relevant images to text, ranging from 5% to 8% and images relevant to text ranging from 3% to 7% compared to the approach based on single-embedding schema.

Highlights

  • The development of deep neural networks (DNN) has revolutionised issues related to the analysis of images and natural language processing [1]

  • In order to find illustrative images for each SX according to the Text Summarising with Illustrative Pictures (TIPS) method described in Section 2.3, we have prepared an implementation of the proposed solution

  • Based on the results presented in the previous section of our paper, we can conclude that our proposed algorithm allows us to make suggestions of illustrative images that are to some extent semantically consistent with the content of the summaries’ sentences of the text to which they relate

Read more

Summary

Introduction

The development of deep neural networks (DNN) has revolutionised issues related to the analysis of images and natural language processing [1]. The ability to generate feature vectors (embedding) from both images and texts has greatly facilitated the semantic analysis of these media. Interesting are issues of image-text matching to determine the semantic similarity between them. Image-text matching is an important multi-modal task with a wide range of applications [2]. Research in this area using deep neural networks is relatively new, and many of the relevant results have been published in work from within the last three years.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call