Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yehao Li,Tao Mei,Yingwei Pan,Jiahao Fan,Weiyao Lin,Ting Yao

doi:10.1145/3473140

Abstract

Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Feb 16, 2022
Citations: 6

Similar Papers

Advancing Accuracy in Multimodal Medical Tasks Through Bootstrapped Language-Image Pretraining (BioMedBLIP): Performance Evaluation Study.
Usman Naseem ... Anum Masood
JMIR medical informatics | VOL. 12
Usman Naseem, et. al.Usman Naseem ... Anum Masood
05 Aug 2024
JMIR medical informatics | VOL. 12

Image captioning for effective use of language models in knowledge-based visual question answering
Ander Salaberria ... Eneko Agirre
Expert Systems with Applications | VOL. 212
Ander Salaberria, et. al.Ander Salaberria ... Eneko Agirre
28 Aug 2022
Expert Systems with Applications | VOL. 212

HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering
Zhiyuan Ma ... Jianjun Li
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37
Zhiyuan Ma, et. al.Zhiyuan Ma ... Jianjun Li
26 Jun 2023
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 37

DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
Weifeng Zhang ... Chuan Ran
Information Fusion | VOL. 72
Weifeng Zhang, et. al.Weifeng Zhang ... Chuan Ran
12 Feb 2021
Information Fusion | VOL. 72

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications