Abstract

Multimodal summarization has drawn much attention due to the rapid growth of multimedia data. The output of the current multimodal summarization systems is usually represented in texts. However, we have found through experiments that multimodal output can significantly improve user satisfaction for informativeness of summaries. In this paper, we propose a novel task, multimodal summarization with multimodal output (MSMO). To handle this task, we first collect a large-scale dataset for MSMO research. We then propose a multimodal attention model to jointly generate text and select the most relevant image from the multimodal input. Finally, to evaluate multimodal outputs, we construct a novel multimodal automatic evaluation (MMAE) method which considers both intra-modality salience and inter-modality relevance. The experimental results show the effectiveness of MMAE.

Highlights

  • Text summarization is to extract the important information from source documents

  • For the output with only the text summary, user will be confused about the description of “four-legged creatures”; while with a relevant image, user will have a clearer understanding of the text

  • We conduct the following five sets of experiments: 1) To verify our motivation of the multimodal output, we design an experiment for user satisfaction test (Sec. 4.2); 2) We compare our multimodal summarization with text summarization from both ROUGE score and manual evaluation (Sec. 4.3); 3) To verify the effectiveness of our evaluation metrics, we calculate the correlation between these metrics and human judgments (Sec. 4.4); 4) We conduct two experiments to show the effectiveness of our proposed multimodal automatic evaluation (MMAE) and the generalization of MMAE respectively (Sec. 4.5); 5) we evaluate our multimodal attention model with MMAE (Sec. 4.6)

Read more

Summary

Introduction

Text summarization is to extract the important information from source documents. With the increase of multimedia data on the internet, some researchers (Li et al, 2016b; Shah et al, 2016; Li et al, 2017) focus on multimodal summarization in recent years. Based on the above discussion, in this work, we propose a novel task which we refer to as Multimodal Summarization with Multimodal Output (MSMO). To explore this task, we focus on the simplicity, we first consider only one image) and a piece of text. We propose a multimodal attention model to jointly generate text and the most relevant image, in which the importance of images is determined by the visual coverage vector. We construct a novel multimodal automatic evaluation (MMAE) which jointly considers salience of text, salience of image, and image-text relevance. We propose a multimodal automatic evaluation (MMAE) method which mainly considers three aspects: salience of text, salience of image, and relevance between text and image

Overview
Pointer-Generator Network
Multimodal Attention Model
Multimodal Automatic Evaluation
Salience of Text
Salience of Image
Image-Text Relevance
Experiments
Dataset
User Satisfaction Test
Comparison with Text Summarization
Correlation Test
Effectiveness and Generalization of MMAE
Model Performances
Related Work
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.