MSMO: Multimodal Summarization with Multimodal Output

Junnan Zhu,Haoran Li,Jiajun Zhang,Chengqing Zong,Yu Zhou,Tianshang Liu

doi:10.18653/v1/d18-1448

Abstract

Multimodal summarization has drawn much attention due to the rapid growth of multimedia data. The output of the current multimodal summarization systems is usually represented in texts. However, we have found through experiments that multimodal output can significantly improve user satisfaction for informativeness of summaries. In this paper, we propose a novel task, multimodal summarization with multimodal output (MSMO). To handle this task, we first collect a large-scale dataset for MSMO research. We then propose a multimodal attention model to jointly generate text and select the most relevant image from the multimodal input. Finally, to evaluate multimodal outputs, we construct a novel multimodal automatic evaluation (MMAE) method which considers both intra-modality salience and inter-modality relevance. The experimental results show the effectiveness of MMAE.

Highlights

Text summarization is to extract the important information from source documents
For the output with only the text summary, user will be confused about the description of “four-legged creatures”; while with a relevant image, user will have a clearer understanding of the text
We conduct the following five sets of experiments: 1) To verify our motivation of the multimodal output, we design an experiment for user satisfaction test (Sec. 4.2); 2) We compare our multimodal summarization with text summarization from both ROUGE score and manual evaluation (Sec. 4.3); 3) To verify the effectiveness of our evaluation metrics, we calculate the correlation between these metrics and human judgments (Sec. 4.4); 4) We conduct two experiments to show the effectiveness of our proposed multimodal automatic evaluation (MMAE) and the generalization of MMAE respectively (Sec. 4.5); 5) we evaluate our multimodal attention model with MMAE (Sec. 4.6)

Summary

Introduction

Text summarization is to extract the important information from source documents. With the increase of multimedia data on the internet, some researchers (Li et al, 2016b; Shah et al, 2016; Li et al, 2017) focus on multimodal summarization in recent years. Based on the above discussion, in this work, we propose a novel task which we refer to as Multimodal Summarization with Multimodal Output (MSMO). To explore this task, we focus on the simplicity, we first consider only one image) and a piece of text. We propose a multimodal attention model to jointly generate text and the most relevant image, in which the importance of images is determined by the visual coverage vector. We construct a novel multimodal automatic evaluation (MMAE) which jointly considers salience of text, salience of image, and image-text relevance. We propose a multimodal automatic evaluation (MMAE) method which mainly considers three aspects: salience of text, salience of image, and relevance between text and image

Overview

Pointer-Generator Network

Multimodal Attention Model

Multimodal Automatic Evaluation

Salience of Text

Salience of Image

Image-Text Relevance

Experiments

Dataset

User Satisfaction Test

Comparison with Text Summarization

Correlation Test

Effectiveness and Generalization of MMAE

Model Performances

Related Work

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MSMO: Multimodal Summarization with Multimodal Output

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2018
Citations: 138	License type: cc-by

Similar Papers

Graph-based Multimodal Ranking Models for Multimodal Summarization
Junnan Zhu ... Yu Zhou
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 20
Junnan Zhu, et. al.Junnan Zhu ... Yu Zhou
26 May 2021
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 20

Multimodal Cross-lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-stage Training Method.
Nayu Liu ... Cunhang Fan
IEEE transactions on pattern analysis and machine intelligence | VOL. PP
Nayu Liu, et. al.Nayu Liu ... Cunhang Fan
22 Aug 2024
IEEE transactions on pattern analysis and machine intelligence | VOL. PP

Multimodal Summarization with Guidance of Multimodal Reference
Junnan Zhu ... Chengqing Zong
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Junnan Zhu, et. al.Junnan Zhu ... Chengqing Zong
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment
Huan Rong ... Zhongfeng Chen
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Huan Rong, et. al.Huan Rong ... Zhongfeng Chen
10 May 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MSMO: Multimodal Summarization with Multimodal Output

Abstract

Highlights

Summary

Talk to us

Similar Papers