Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

Rajarshi Biswas,Michael Barz,Daniel Sonntag

doi:10.1007/s13218-020-00679-2

Rajarshi Biswas, Michael Barz + Show 1 more

Open Access

https://doi.org/10.1007/s13218-020-00679-2

Copy DOI

Abstract

Image captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

Highlights

The goal of image captioning is to automatically generate descriptions for a given image, i.e., to capture the relationship between the objects present in the image, generate natural language expressions, and judge the quality of the generated descriptions
We show that effective re-ranking of caption candidates from a beam search decoder has a huge potential for improving results
We presented a new architecture for image captioning that incorporates a top-down attention mechanism with bottom-up features of a scene: we encoded the object specific bounding boxes provided by the Mask R-CNN model [13] using the Resnet-101 architecture [14]

Summary

Introduction

The goal of image captioning is to automatically generate descriptions for a given image, i.e., to capture the relationship between the objects present in the image, generate natural language expressions (see an example in Fig. 1), and judge the quality of the generated descriptions. The problem, is seemingly more difficult than popular computer vision tasks, e.g., object detection or segmentation, where the emphasis is solely on identifying the different entities present in the image. We adopt and extend the architecture proposed in [49] since it is the most cited seminal work in the area of image captioning. It introduced the encoderdecoder architecture and the visual attention mechanism for image captioning in a simple yet powerful approach.

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: KI - Künstliche Intelligenz	Publication Date: Jul 8, 2020
Citations: 22	License type: open-access

R Discovery Prime

R Discovery Prime

Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: KI - Künstliche Intelligenz

Lead the way for us

Similar Papers

Optimized Image Captioning: Hybrid Transformers Vision Transformers and Convolutional Neural Networks: Enhanced with Beam Search
Sushma Jaiswal ... Rajesh P Chinchewadi
International Journal of Intelligent Systems and Applications | VOL. 16
Sushma Jaiswal, et. al.Sushma Jaiswal ... Rajesh P Chinchewadi
08 Apr 2024
International Journal of Intelligent Systems and Applications | VOL. 16

Review Paper on Enhanced Image Captioning with Deep Learning: Encoder-Decoder and Attention Mechanism
Et Al Vikash Kumar Singh
International Journal on Recent and Innovation Trends in Computing and Communication | VOL. 11
Et Al Vikash Kumar SinghEt Al Vikash Kumar Singh
30 Oct 2023
International Journal on Recent and Innovation Trends in Computing and Communication | VOL. 11

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson ... Stephen Gould
-
Peter Anderson, et. al.Peter Anderson ... Stephen Gould
01 Jun 2018
01 Jun 2018

Synthesis of Vision and Language: Multifaceted Image Captioning Application
Arpit Gupta ... Himanshu Goyal
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07
Arpit Gupta, et. al.Arpit Gupta ... Himanshu Goyal
23 Dec 2023
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: KI - Künstliche Intelligenz