Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Muhammad Arslan Manzoor,Ziting Xian,Zaiqiao Meng,Sarah Albarri,Shangsong Liang,Preslav Nakov

doi:10.1145/3617833

Abstract

Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the ( i ) recent task-specific deep learning methodologies, ( ii ) the pretraining types and multimodal pretraining objectives, ( iii ) from state-of-the-art pretrained multimodal approaches to unifying architectures, and ( iv ) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Oct 23, 2023
Citations: 3

Similar Papers

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
Jack Hessel ... Lillian Lee
-
Jack Hessel, et. al.Jack Hessel ... Lillian Lee
01 Jan 2020
01 Jan 2020

New Ideas and Trends in Deep Multimodal Content Understanding: A Review
Wei Chen ... Michael S Lew
Neurocomputing | VOL. 426
Wei Chen, et. al.Wei Chen ... Michael S Lew
23 Oct 2020
Neurocomputing | VOL. 426

Second Order Enhanced Multi-glimpse Attention in Visual Question Answering
Qiang Sun ... Binghui Xie
-
Qiang Sun, et. al.Qiang Sun ... Binghui Xie
01 Jan 2020
01 Jan 2020

T5 - Representation

-

05 Jul 2022
05 Jul 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications