Pre-training Model Based on Parallel Cross-Modality Fusion Layer.

Xuewei Li,Chin-Chen Chang,Dezhi Han,Chi-Hua Chen

doi:10.1371/journal.pone.0260784

Xuewei Li, Chin-Chen Chang + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0260784

Copy DOI

Abstract

Visual Question Answering (VQA) is a learning task that combines computer vision with natural language processing. In VQA, it is important to understand the alignment between visual concepts and linguistic semantics. In this paper, we proposed a Pre-training Model Based on Parallel Cross-Modality Fusion Layer (P-PCFL) to learn the fine-grained relationship between vision and language. The P-PCFL model is composed of three Encoders: Object Encoder, Language Encoder, and Parallel Cross-Modality Fusion Encoder, with Transformer as the core. We use four different Pre-training missions, namely, Cross-Modality Mask Language Modeling, Cross-Modality Mask Region Modeling, Image-Text Matching, and Image-Text Q&A, to pre-train the P-PCFL model and improve its reasoning and universality, which help to learn the relationship between Intra-modality and Inter-modality. Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters. In addition, we also conduct ablation experiments and provide some results of Attention visualization to verify the effectiveness of P-PCFL model.

Highlights

RESEARCH ARTICLEOPEN ACCESS Citation: Li X, Han D, Chang C-C (2022) Pretraining Model Based on Parallel Cross-Modality Fusion Layer
With the continuous development of computer vision technology and natural language processing technology, researchers go deeper in the Visual Question Answering (VQA) research field
Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters

Summary

RESEARCH ARTICLE

OPEN ACCESS Citation: Li X, Han D, Chang C-C (2022) Pretraining Model Based on Parallel Cross-Modality Fusion Layer. Data Availability Statement: The data underlying the results presented in the study are available from (include the name of the third party https:// visualqa.org/vqa_v2_teaser.html). Visual Question Answering (VQA) is a learning task that combines computer vision with natural language processing. We proposed a Pre-training Model Based on Parallel Cross-Modality Fusion Layer (P-PCFL) to learn the fine-grained relationship between vision and language. The P-PCFL model is composed of three Encoders: Object Encoder, Language Encoder, and Parallel Cross-Modality Fusion Encoder, with Transformer as the core. Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters. We conduct ablation experiments and provide some results of Attention visualization to verify the effectiveness of P-PCFL model

Introduction

Main contributions of this paper

Model framework

Language Encoder

Object Encoder

Fine tuning

Experimental data set

Experimental settings and model parameters

Ablation experiment

Comparative experiment

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Pre-training Model Based on Parallel Cross-Modality Fusion Layer.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Journal: PloS one	Publication Date: Feb 3, 2022
License type: CC BY 4.0

Similar Papers

Declaration-based Prompt Tuning for Visual Question Answering
Yuhang Liu ... Daowan Peng
-
Yuhang Liu, et. al.Yuhang Liu ... Daowan Peng
01 Jul 2022
01 Jul 2022

Towards Fine-Tuning of VQA Models in Public Datasets
Miguel E Ortiz ... Sergio Álvarez
-
Miguel E Ortiz, et. al.Miguel E Ortiz ... Sergio Álvarez
03 Nov 2020
03 Nov 2020

Investigation of Available Datasets and Techniques for Visual Question Answering
Lata Bhavnani ... Dr Narendra Patel
International Journal of Next-Generation Computing | VOL. -
Lata Bhavnani, et. al.Lata Bhavnani ... Dr Narendra Patel
03 Aug 2023
International Journal of Next-Generation Computing | VOL. -

Recovering Generalization via Pre-Training-Like Knowledge Distillation for Out-of-Distribution Visual Question Answering
Yaguang Song ... Xiaoshan Yang
IEEE Transactions on Multimedia | VOL. 26
Yaguang Song, et. al.Yaguang Song ... Xiaoshan Yang
01 Jan 2024
IEEE Transactions on Multimedia | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pre-training Model Based on Parallel Cross-Modality Fusion Layer.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one