GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks

Onkar Susladkar,Gayatri Deshmukh,Dhruv Makwana,Rekha Singhal,Sparsh Mittal,R Sai Chandra Teja

doi:10.1109/wacv56688.2023.00521

Abstract

In "vision and language" problems, multimodal inputs are simultaneously processed for combined visual and textual understanding for image-text embedding. In this paper, we discuss the necessity of considering the difference between the feature space and the distribution when performing multimodal learning. We deal with this problem through deep learning and a generative model approach. We introduce a novel network, GAFNet (Global Attention Fourier Net), which learns through large-scale pre-training over three image-text datasets (COCO, SBU, and CC-3M), for achieving high performance on downstream vision and language tasks. We propose a GAF (Global Attention Fourier) module, which integrates multiple modalities into one latent space. GAF module is independent of the type of modality, and it allows combining shared representations at each stage. Various ways of thinking about the relationships between different modalities directly affect the model’s design. In contrast to previous research, our work considers visual grounding as a pretrainable and transferable quality instead of something that must be trained from scratch. We show that GAFNet is a versatile network that can be used for a wide range of downstream tasks. Experimental results demonstrate that our technique achieves state-of-the-art performance on multimodal classification on the CrisisMD dataset and image generation on the COCO dataset. For image-text retrieval, our technique achieves competitive performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao ... Mu Li
-
Xiaoshuai Hao, et. al.Xiaoshuai Hao ... Mu Li
01 Jan 2023
01 Jan 2023

Deep Metric Representation Learning for Clinical Resting State fMRI.
Arunesh Mittal ... Paul Sajda
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference | VOL. 2022
Arunesh Mittal, et. al.Arunesh Mittal ... Paul Sajda
11 Jul 2022
11 Jul 2022

On Field Implementation of Real-Time Bit-Wear Estimation with Bit Agnostic Deep Learning Artificial Intelligence Model Along with Physics-Hybrid Features
Huang Xu ... Guodong David Zhan
-
Huang Xu, et. al.Huang Xu ... Guodong David Zhan
23 May 2023
23 May 2023

Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts.
Alex Jinpeng Wang ... Shuicheng Yan
IEEE transactions on pattern analysis and machine intelligence | VOL. 46
Alex Jinpeng Wang, et. al.Alex Jinpeng Wang ... Shuicheng Yan
01 May 2024
IEEE transactions on pattern analysis and machine intelligence | VOL. 46

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks

Abstract

Talk to us

Similar Papers