Abstract
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3% on Flicker30K zero-shot), visual grounding (+0.9% on Re-fCOCO+), visual reasoning (+0.9% on NLVR <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ), visual question answering (+0.3% on VQA2.0), and visual entail-ment (+0.4% on SNLI-VE).
Full Text
Topics from this Paper
Downstream Tasks
Visual Question Answering
Image-text Pairs
Image-text Retrieval
Visual Grounding
+ Show 5 more
Create a personalized feed of these topics
Get StartedTalk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
Jul 18, 2023
May 1, 2022
Jan 1, 2020
Jul 11, 2021
IEEE Transactions on Multimedia
Jan 1, 2023
Journal of Imaging
Jul 26, 2021
Aug 1, 2021
Jul 11, 2022
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
Jul 1, 2022
Computer-Aided Civil and Infrastructure Engineering
Aug 18, 2023
Jan 1, 2016
ACM Transactions on Multimedia Computing, Communications, and Applications
Mar 4, 2022
ACM Transactions on Software Engineering and Methodology
Mar 30, 2023