Abstract

Visual Question Answering (VQA) is a learning task that combines computer vision with natural language processing. In VQA, it is important to understand the alignment between visual concepts and linguistic semantics. In this paper, we proposed a Pre-training Model Based on Parallel Cross-Modality Fusion Layer (P-PCFL) to learn the fine-grained relationship between vision and language. The P-PCFL model is composed of three Encoders: Object Encoder, Language Encoder, and Parallel Cross-Modality Fusion Encoder, with Transformer as the core. We use four different Pre-training missions, namely, Cross-Modality Mask Language Modeling, Cross-Modality Mask Region Modeling, Image-Text Matching, and Image-Text Q&A, to pre-train the P-PCFL model and improve its reasoning and universality, which help to learn the relationship between Intra-modality and Inter-modality. Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters. In addition, we also conduct ablation experiments and provide some results of Attention visualization to verify the effectiveness of P-PCFL model.

Highlights

  • RESEARCH ARTICLEOPEN ACCESS Citation: Li X, Han D, Chang C-C (2022) Pretraining Model Based on Parallel Cross-Modality Fusion Layer

  • With the continuous development of computer vision technology and natural language processing technology, researchers go deeper in the Visual Question Answering (VQA) research field

  • Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters

Read more

Summary

RESEARCH ARTICLE

OPEN ACCESS Citation: Li X, Han D, Chang C-C (2022) Pretraining Model Based on Parallel Cross-Modality Fusion Layer. Data Availability Statement: The data underlying the results presented in the study are available from (include the name of the third party https:// visualqa.org/vqa_v2_teaser.html). Visual Question Answering (VQA) is a learning task that combines computer vision with natural language processing. We proposed a Pre-training Model Based on Parallel Cross-Modality Fusion Layer (P-PCFL) to learn the fine-grained relationship between vision and language. The P-PCFL model is composed of three Encoders: Object Encoder, Language Encoder, and Parallel Cross-Modality Fusion Encoder, with Transformer as the core. Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters. We conduct ablation experiments and provide some results of Attention visualization to verify the effectiveness of P-PCFL model

Introduction
Main contributions of this paper
Model framework
Language Encoder
Object Encoder
Fine tuning
Experimental data set
Experimental settings and model parameters
Ablation experiment
Comparative experiment
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.