SCT: Summary Caption Technique for Retrieving Relevant Images in Alignment with Multimodal Abstractive Summary

Shaik Rafi,Ranjita Das

doi:10.1145/3645029

Abstract

This work proposes an efficient Summary Caption Technique that considers the multimodal summary and image captions as input to retrieve the correspondence images from the captions that are highly influential to the multimodal summary. Matching a multimodal summary with an appropriate image is a challenging task in computer vision and natural language processing. Merging in these fields is tedious, though the research community has steadily focused on cross-modal retrieval. These issues include the visual question-answering, matching queries with the images, and semantic relationship matching between two modalities for retrieving the corresponding image. Relevant works consider questions to match the relationship of visual information and object detection and to match the text with visual information and employing structural-level representation to align the images with the text. However, these techniques are primarily focused on retrieving the images to text or for image captioning. But less effort has been spent on retrieving relevant images for the multimodal summary. Hence, our proposed technique extracts and merge features in the Hybrid Image Text layer and captions in the semantic embeddings with word2vec where the contextual features and semantic relationships are compared and matched with each vector between the modalities, with cosine semantic similarity. In cross-modal retrieval, we achieve top five related images and align the relevant images to the multimodal summary that achieves the highest cosine score among the retrieved images. The model has been trained with seq-to-seq modal with 100 epochs, besides reducing the information loss by the sparse categorical cross entropy. Further, experimenting with the multimodal summarization with multimodal output dataset, in cross-modal retrieval, helps to evaluate the quality of image alignment with an image-precision metric that demonstrate the best results.

Full Text