Multimodal Summarization Research Articles

Multi-modal abstractive summarization for videos is an emerging task, aiming to integrate multi-modal and multi-source inputs (video, audio transcript) into a compressed textual summary. Although recent multi-encoder-decoder models on this task have shown promising performance, they did not explicitly model interactions of multi-source inputs. While some strategies like co-attention are utilized for modeling this interaction, considering ultra-long sequences and additional decoder in this task, the coupling of multi-modal data from multi-encoders and decoder needs complicated structure and additional parameters. In this paper, we propose a concise Decoder-only Multi-modal Transformer (D-MmT) based on the above observations. Specifically, we cut the encoder structure, and introduce an in-out shared multi-modal decoder to make the multi-source and target fully interact and couple in the shared feature space, reducing the model parameter redundancy. Also, we design a concise cascaded cross-modal interaction (CXMI) module in the multi-modal decoder that generates joint fusion representations and spontaneously establishes a fine-grained intra- and inter- association between multi-modalities. In addition, to make full use of the ultra-long sequence information, we introduce a joint in-out loss to make the input transcript also participate in backpropagation to enhance the contextual feature representation. The experimental results on the How2 dataset show that the proposed model outperforms the current state-of-the-art approach with fewer model parameters. Further analysis and visualization show the effectiveness of our proposed framework.

Cuisine is a style of cooking and usually associated with a specific geographic region. Recipes from different cuisines shared on the web are an indicator of culinary cultures in different countries. Therefore, analysis of these recipes can lead to deep understanding of food from the cultural perspective. In this paper, we perform the first cross-region recipe analysis by jointly using the recipe ingredients, food images, and attributes such as the cuisine and course (e.g., main dish and dessert). For that solution, we propose a culinary culture analysis framework to discover the topics of ingredient bases and visualize them to enable various applications. We first propose a probabilistic topic model to discover cuisine-course specific topics. The manifold ranking method is then utilized to incorporate deep visual features to retrieve food images for topic visualization. At last, we applied the topic modeling and visualization method for three applications: 1) multimodal cuisine summarization with both recipe ingredients and images, 2) cuisine-course pattern analysis including topic-specific cuisine distribution and cuisine-specific course distribution of topics, and 3) cuisine recommendation for both cuisine-oriented and ingredient-oriented queries. Through these three applications, we can analyze the culinary cultures at both macro and micro levels. We conduct the experiment on a recipe database Yummly-66K with 66,615 recipes from 10 cuisines in Yummly. Qualitative and quantitative evaluation results have validated the effectiveness of topic modeling and visualization, and demonstrated the advantage of the framework in utilizing rich recipe information to analyze and interpret the culinary cultures from different regions.

Multimodal Summarization Research Articles

Articles published on Multimodal Summarization

Research on Multimodal Summarization Based on Sequence to Sequence Model

UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation

Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization.

Machine learning based fusion algorithm to perform multimodal summarization

트랜스포머기반의 멀티모달 영상자막 생성요약

Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering

Multimodal Summarization of User-Generated Videos

Graph-based Multimodal Ranking Models for Multimodal Summarization

LAMS: A Location-aware Approach for Multimodal Summarization (Student Abstract)

D-MmT: A concise decoder-only multi-modal transformer for abstractive summarization in videos

MBA: A Multimodal Bilinear Attention Model with Residual Connection for Abstractive Multimodal Summarization

Multimodal Summarization with Guidance of Multimodal Reference

Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products

Multi-Modal Deep Analysis for Multimedia

Text summarization using multiobjective optimization

Read, Watch, Listen, and Summarize: Multi-Modal Summarization for Asynchronous Text, Image, Audio and Video

Extractive summarization of documents with images based on multi-modal RNN

You Are What You Eat: Exploring Rich Recipe Information for Cross-Region Food Analysis

Multimedia Pivot Tables for Multimedia Analytics on Image Collections

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Summarization Research Articles

Articles published on Multimodal Summarization

Research on Multimodal Summarization Based on Sequence to Sequence Model

UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation

Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization.

Machine learning based fusion algorithm to perform multimodal summarization

트랜스포머기반의 멀티모달 영상자막 생성요약

Multi-Tier Attention Network using Term-weighted Question Features for Visual Question Answering

Multimodal Summarization of User-Generated Videos

Graph-based Multimodal Ranking Models for Multimodal Summarization

LAMS: A Location-aware Approach for Multimodal Summarization (Student Abstract)

D-MmT: A concise decoder-only multi-modal transformer for abstractive summarization in videos

MBA: A Multimodal Bilinear Attention Model with Residual Connection for Abstractive Multimodal Summarization

Multimodal Summarization with Guidance of Multimodal Reference

Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products

Multi-Modal Deep Analysis for Multimedia

Text summarization using multiobjective optimization

Read, Watch, Listen, and Summarize: Multi-Modal Summarization for Asynchronous Text, Image, Audio and Video

Extractive summarization of documents with images based on multi-modal RNN

You Are What You Eat: Exploring Rich Recipe Information for Cross-Region Food Analysis

Multimedia Pivot Tables for Multimedia Analytics on Image Collections