Delving into Precise Attention in Image Captioning

Shaohan Hu,Zheng Qin,Shenglei Huang,Guolong Wang,Zhipeng Li

doi:10.1007/978-3-030-36802-9_9

Abstract

Recent image captioning models usually directly use the output of the last convolutional layer from a pretrained CNN encoder. This intuitive design remains two weaknesses: the top layer feature is not position-sensitive which is harmful for the decoder to generate precise spatial attention for object of interest; irrelevant features will mislead the decoder into focusing irrelevant regions. To tackle these weaknesses, we propose Feature Selection and Fusion Network (FSFN). Specifically, to tackle the first weakness, Feature Fusion module is proposed to generate fine-grained and position-sensitive features by fusing multi-scale features. To handle the second weakness, Feature Selection module is proposed to select more informative features which will prevent the decoder from focusing on irrelevant regions. Extensive experiments demonstrate that our model has successfully addressed the above two weaknesses and can achieve comparable results with the state-of-the-art under cross entropy loss without any bells and whistles on MSCOCO dataset. Furthermore, our model can improve the performance under different encoders and decoders.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Delving into Precise Attention in Image Captioning

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

MFEFNet: A Multi-Scale Feature Information Extraction and Fusion Network for Multi-Scale Object Detection in UAV Aerial Images
Liming Zhou ... Yadi Wang
Drones | VOL. 8
Liming Zhou, et. al.Liming Zhou ... Yadi Wang
08 May 2024
Drones | VOL. 8

Autonomous Multiple Tramp Materials Detection in Raw Coal Using Single-Shot Feature Fusion Detector
Dongjun Li ... Zhiyuan Sun
Applied Sciences | VOL. 12
Dongjun Li, et. al.Dongjun Li ... Zhiyuan Sun
23 Dec 2021
Applied Sciences | VOL. 12

F3N: Full Feature Fusion Network for Object Detection
Gang Wang ... Kazushige Ouchi
-
Gang Wang, et. al.Gang Wang ... Kazushige Ouchi
24 Dec 2020
24 Dec 2020

Classification of Ocular Diseases Employing Attention-Based Unilateral and Bilateral Feature Weighting and Fusion
Junjun He ... Shanshan Wang
-
Junjun He, et. al.Junjun He ... Shanshan Wang
01 Apr 2020
01 Apr 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Delving into Precise Attention in Image Captioning

Abstract

Talk to us

Similar Papers