Unifying Vision-Language Representation Space with Single-Tower Transformer

Jiho Jang,Nojun Kwak,Seonhoon Kim,Chaerin Kong,Donghyeon Jeon

doi:10.1609/aaai.v37i1.25178

Abstract

Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this work, we explore the hypothesis that an image and caption can be regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a one-tower model for vision-language pretraining (VLP), and propose One Representation (OneR) as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that have modality-specific representation spaces such as zero-shot localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Unifying Vision-Language Representation Space with Single-Tower Transformer

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 26, 2023
Citations: 2

Similar Papers

Adapt and explore: Multimodal mixup for representation learning
Ronghao Lin ... Haifeng Hu
Information Fusion | VOL. 105
Ronghao Lin, et. al.Ronghao Lin ... Haifeng Hu
28 Dec 2023
Information Fusion | VOL. 105

Mutual Information Regularization for Weakly-Supervised RGB-D Salient Object Detection
Aixuan Li ... Jing Zhang
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 34
Aixuan Li, et. al.Aixuan Li ... Jing Zhang
01 Jan 2024
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 34

Learning unified sparse representations for multi-modal data
Kaiye Wang ... Wei Wang
-
Kaiye Wang, et. al.Kaiye Wang ... Wei Wang
01 Sep 2015
01 Sep 2015

Towards developing a unified multimodal image retrieval framework
Zhongfei Zhang ... Zhen Guo
-
Zhongfei Zhang, et. al.Zhongfei Zhang ... Zhen Guo
01 Jun 2009
01 Jun 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unifying Vision-Language Representation Space with Single-Tower Transformer

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence