Double-Stream Position Learning Transformer Network for Image Captioning

Weitao Jiang,Wei Zhou,Haifeng Hu

doi:10.1109/tcsvt.2022.3181490

Abstract

Image captioning has made significant achievement through developing feature extractor and model architecture. Recently, the image region features extracted by object detector prevail in most existing models. However, region features are criticized for the lacking of background and full contextual information. This problem can be remedied by providing some complementary visual information from patch features. In this paper, we propose a Double-Stream Position Learning Transformer Network (DSPLTN) which exploits the advantages of region features and patch features. Specifically, the region-stream encoder utilizes a Transformer encoder with Relative Position Learning (RPL) module to enhance the representations of region features through modeling the relationships between regions and positions respectively. As for the patch-stream encoder, we introduce convolutional neural network into the vanilla Transformer encoder and propose a novel Convolutional Position Learning (CPL) module to encode the position relationships between patches. CPL improves the ability of relationship modeling by combining the position and visual content of patches. Incorporating CPL into the Transformer encoder can synthesize the benefits of convolution in local relation modeling and self-attention in global feature fusion, thereby compensating for the information loss caused by the flattening operation of 2D feature maps to 1D patches. Furthermore, an Adaptive Fusion Attention (AFA) mechanism is proposed to balance the contribution of enhanced region and patch features. Extensive experiments on MSCOCO demonstrate the effectiveness of the double-stream encoder and CPL, and show the superior performance of DSPLTN.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Double-Stream Position Learning Transformer Network for Image Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology

Lead the way for us

Journal: IEEE Transactions on Circuits and Systems for Video Technology	Publication Date: Nov 1, 2022
Citations: 22

Similar Papers

Multimodal Neural Machine Translation Using CNN and Transformer Encoder
Hiroki Takushima ... Takashi Ninomiya
-
Hiroki Takushima, et. al.Hiroki Takushima ... Takashi Ninomiya
02 Apr 2019
02 Apr 2019

Transformer based on multi-scale local feature for colon cancer histopathological image classification
Zhibing Fu ... Chen Huang
Biomedical Signal Processing and Control | VOL. 100
Zhibing Fu, et. al.Zhibing Fu ... Chen Huang
30 Sep 2024
Biomedical Signal Processing and Control | VOL. 100

Learning consistent region features for lifelong person re-identification
Jinze Huang ... Jun Zhou
Pattern Recognition | VOL. 144
Jinze Huang, et. al.Jinze Huang ... Jun Zhou
27 Jul 2023
Pattern Recognition | VOL. 144

JGRCAN: A Visual Question Answering Co-Attention Network via Joint Grid-Region Features
Jianpeng Liang ... Zhuopan Ao
Mathematical Problems in Engineering | VOL. 2022
Jianpeng Liang, et. al.Jianpeng Liang ... Zhuopan Ao
15 Oct 2022
Mathematical Problems in Engineering | VOL. 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Double-Stream Position Learning Transformer Network for Image Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology