Multimodal Summarization Research Articles

This work proposes an efficient Summary Caption Technique that considers the multimodal summary and image captions as input to retrieve the correspondence images from the captions that are highly influential to the multimodal summary. Matching a multimodal summary with an appropriate image is a challenging task in computer vision and natural language processing. Merging in these fields is tedious, though the research community has steadily focused on cross-modal retrieval. These issues include the visual question-answering, matching queries with the images, and semantic relationship matching between two modalities for retrieving the corresponding image. Relevant works consider questions to match the relationship of visual information and object detection and to match the text with visual information and employing structural-level representation to align the images with the text. However, these techniques are primarily focused on retrieving the images to text or for image captioning. But less effort has been spent on retrieving relevant images for the multimodal summary. Hence, our proposed technique extracts and merge features in the Hybrid Image Text layer and captions in the semantic embeddings with word2vec where the contextual features and semantic relationships are compared and matched with each vector between the modalities, with cosine semantic similarity. In cross-modal retrieval, we achieve top five related images and align the relevant images to the multimodal summary that achieves the highest cosine score among the retrieved images. The model has been trained with seq-to-seq modal with 100 epochs, besides reducing the information loss by the sparse categorical cross entropy. Further, experimenting with the multimodal summarization with multimodal output dataset, in cross-modal retrieval, helps to evaluate the quality of image alignment with an image-precision metric that demonstrate the best results.

Read full abstract

In this paper, we focus on analyzing the relationship between the input of source text and source image, and then through the integration and generalization of the multi-modal information (e.g., texts and images), outputs the multi-modal summarization including text and image. However, existing multi-modal summarization methods face several challenges: (1) Different modalities exist in different semantic spaces, and expressing the multi-modal information of source modalities in a similar semantic representation space is crucial. (2) For the result of text summarization, it is crucial to explore how to capture both the differences and similarities within the source modalities, which can reduce redundancy to improve the quality of text summarization. In order to overcome the challenge above, Multi-Modal Anchor Adaptation Learning for Multi-Modal Summarization (MA-Sum) has been proposed. Specifically, MA-Sum employs a novel and highly efficient image anchor selection method, which selects the object sample containing the richest image information as the image anchor. Simultaneously, it carefully selects the text sentence closely intertwined with the image semantics to serve as the language anchor. Therefore, the multi-modal anchor can be seen as a bridge for multi-modal alignment to alleviate the semantic gap between textual and visual. Moreover, based on the distance between the anchors and the semantic information in the respective modal, the positive and negative semantic information of each modal will be distinguished. Based on negative semantic information, the counterfactual learning mechanism is constructed to optimize the result of multi-modal summarization. Finally, the process of multi-modal features interaction is optimized by image summary which is chosen by using multi-modal anchors. According to the experimental results, compared with the state-of-art multi-modal summarization, our proposed MA-Sum can be optimized in terms of summarization consistency and completeness, so as to obtain the optimal multi-modal summarization metric.

Read full abstract

Multimodal Summarization Research Articles

Articles published on Multimodal Summarization

Heterogeneous graphormer for extractive multimodal summarization

Abstractive text summarization: State of the art, challenges, and improvements

Align vision-language semantics by multi-task learning for multi-modal summarization

Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

DIUSum: Dynamic Image Utilization for Multimodal Summarization

CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare

Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization

SCT: Summary Caption Technique for Retrieving Relevant Images in Alignment with Multimodal Abstractive Summary

Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism

MMSFT: Multilingual Multimodal Summarization by Fine-Tuning Transformers

Multi-modal anchor adaptation learning for multi-modal summarization

Multimodal text summarization with evaluation approaches

Topic-guided abstractive multimodal summarization with multimodal output

A novel multi-modal neural network approach for dynamic and generic sports video summarization

A Survey on Multi-modal Summarization

Learning a holistic and comprehensive code representation for code summarization

Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization

Abstractive Summarization for Video: A Revisit in Multistage Fusion Network With Forget Gate

Extractive text-image summarization with relation-enhanced graph attention network

On Multimodal Microblog Summarization

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Summarization Research Articles

Articles published on Multimodal Summarization

Heterogeneous graphormer for extractive multimodal summarization

Abstractive text summarization: State of the art, challenges, and improvements

Align vision-language semantics by multi-task learning for multi-modal summarization

Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

DIUSum: Dynamic Image Utilization for Multimodal Summarization

CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare

Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization

SCT: Summary Caption Technique for Retrieving Relevant Images in Alignment with Multimodal Abstractive Summary

Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism

MMSFT: Multilingual Multimodal Summarization by Fine-Tuning Transformers

Multi-modal anchor adaptation learning for multi-modal summarization

Multimodal text summarization with evaluation approaches

Topic-guided abstractive multimodal summarization with multimodal output

A novel multi-modal neural network approach for dynamic and generic sports video summarization

A Survey on Multi-modal Summarization

Learning a holistic and comprehensive code representation for code summarization

Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization

Abstractive Summarization for Video: A Revisit in Multistage Fusion Network With Forget Gate

Extractive text-image summarization with relation-enhanced graph attention network

On Multimodal Microblog Summarization