MixGen: A New Multi-Modal Data Augmentation

Xiaoshuai Hao,Mu Li,Srikar Appalaraju,Bo Li,Aston Zhang,Wanqian Zhang,Yi Zhu

doi:10.1109/wacvw58289.2023.00042

Abstract

Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3% on Flicker30K zero-shot), visual grounding (+0.9% on Re-fCOCO+), visual reasoning (+0.9% on NLVR <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ), visual question answering (+0.3% on VQA2.0), and visual entail-ment (+0.4% on SNLI-VE).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MixGen: A New Multi-Modal Data Augmentation

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Clustering swap prediction for image-text pre-training
Sun Fayou ... Zuqiang Meng
Scientific Reports | VOL. 14
Sun Fayou, et. al.Sun Fayou ... Zuqiang Meng
24 May 2024
Scientific Reports | VOL. 14

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling
Zijia Zhao ... Jing Liu
-
Zijia Zhao, et. al.Zijia Zhao ... Jing Liu
18 Jul 2023
18 Jul 2023

Semi-supervised Grounding Alignment for Multi-modal Feature Learning
Shih-Han Chou ... Leonid Sigal
-
Shih-Han Chou, et. al.Shih-Han Chou ... Leonid Sigal
01 May 2022
01 May 2022

Representation, Learning and Reasoning on Spatial Language for Downstream NLP Tasks
Parisa Kordjamshidi ... Marie-Francine Moens
-
Parisa Kordjamshidi, et. al.Parisa Kordjamshidi ... Marie-Francine Moens
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MixGen: A New Multi-Modal Data Augmentation

Abstract

Talk to us

Similar Papers