SAMT-generator: A second-attention for image captioning based on multi-stage transformer network

Xiaobao Yang,Yang Yang,Sugang Ma,Zhijie Li,Wei Dong,Marcin Woźniak

doi:10.1016/j.neucom.2024.127823

Abstract

In recent years, Transformer has been widely used in the crossing task of computer vision (CV) and natural language processing (NLP), e.g., image captioning. The prior works of image captioning based on Transformer have achieved remarkable progress. In order to further improve the model’s ability to describe detail features and generate high-quality sentences, in the paper, we first construct a Multi-stage Transformer Feature Enhancement Network (MT-FEN), which obtains more semantics by fusing features at different scales. Furthermore, we crucially propose a novel second-attention (SA) that can focus more on valuable features and filter out noises in MT-FEN. Besides that, to generate more sensible sentences, we use multiple decoding layers to choose the best-described word via the Max-out module. The overall model is dubbed as SAMT-Generator. After conducting extensive experiments on two well-known datasets: MS COCO and Flickr8k, the results demonstrate the effectiveness of our proposed model, and it is comparable with SOTA methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SAMT-generator: A second-attention for image captioning based on multi-stage transformer network

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Journal: Neurocomputing	Publication Date: May 10, 2024
Citations: 4

Similar Papers

SCT: Summary Caption Technique for Retrieving Relevant Images in Alignment with Multimodal Abstractive Summary
Shaik Rafi ... Ranjita Das
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Shaik Rafi, et. al.Shaik Rafi ... Ranjita Das
09 Mar 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

Computer Vision and Natural Language Processing
Peratham Wiriyathammabhum ... Yiannis Aloimonos
ACM Computing Surveys | VOL. 49
Peratham Wiriyathammabhum, et. al.Peratham Wiriyathammabhum ... Yiannis Aloimonos
12 Dec 2016
ACM Computing Surveys | VOL. 49

Enhanced gradient learning for deep neural networks
Ming Yan ... Yi Pan
IET Image Processing | VOL. 16
Ming Yan, et. al.Ming Yan ... Yi Pan
09 Nov 2021
IET Image Processing | VOL. 16

Explorations into Deep Learning Text Architectures for Dense Image Captioning
Martina Toshevska ... Sonja Gievska
-
Martina Toshevska, et. al.Martina Toshevska ... Sonja Gievska
26 Sep 2020
26 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SAMT-generator: A second-attention for image captioning based on multi-stage transformer network

Abstract

Talk to us

Similar Papers

More From: Neurocomputing