Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer

Kaijie Wang,Jiao Wu,Xiaoran Guo,Kui Xu,Tiejun Wang

doi:10.3390/app14020807

Abstract

Image–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semantic Embedding (VSE) model. The method introduces an adaptive pooling layer to improve the feature extraction capability of semantic associations between Thangka images and texts. We also improved the traditional Transformer architecture by combining bidirectional residual concatenation and mask attention mechanisms to improve the stability of the matching process and the ability to extract semantic information. In addition, we designed a multi-granularity tag alignment module that maps global and local features of images and text into a common coding space, leveraging inter- and intra-modal semantic associations to improve image and text accuracy. Comparative experiments on the Thangka dataset show that our method achieves significant improvements compared to the VSE baseline method. Specifically, our method improves the recall by 9.4% and 10.5% for image-matching text and text-matching images, respectively. Furthermore, without any large-scale corpus pre-training, our method outperforms all models without pre-training and outperforms two out of four pre-trained models on the Flickr30k public dataset. Also, the execution efficiency of our model is an order of magnitude higher than that of the pre-trained models, which highlights the superior performance and efficiency of our model in the image–text matching task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Journal: Applied Sciences	Publication Date: Jan 17, 2024
License type: CC BY 4.0

Similar Papers

A hybrid approach for vision-based outdoor robot localization using global and local image features
Christian Weiss ... Hashem Tamimi
-
Christian Weiss, et. al.Christian Weiss ... Hashem Tamimi
01 Oct 2007
01 Oct 2007

3G structure for image caption generation
Aihong Yuan ... Xiaoqiang Lu
Neurocomputing | VOL. 330
Aihong Yuan, et. al.Aihong Yuan ... Xiaoqiang Lu
01 Nov 2018
Neurocomputing | VOL. 330

No-Reference Video Quality Assessment Using the Temporal Statistics of Global and Local Image Features.
Domonkos Varga
Sensors (Basel, Switzerland) | VOL. 22
Domonkos VargaDomonkos Varga
10 Dec 2022
Sensors (Basel, Switzerland) | VOL. 22

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip
Hao Zhang ... Xuewei Li
Information Technology and Control | VOL. 53
Hao Zhang, et. al.Hao Zhang ... Xuewei Li
22 Mar 2024
Information Technology and Control | VOL. 53

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer

Abstract

Talk to us

Similar Papers

More From: Applied Sciences