Abstract

Image captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

Highlights

  • The goal of image captioning is to automatically generate descriptions for a given image, i.e., to capture the relationship between the objects present in the image, generate natural language expressions, and judge the quality of the generated descriptions

  • We show that effective re-ranking of caption candidates from a beam search decoder has a huge potential for improving results

  • We presented a new architecture for image captioning that incorporates a top-down attention mechanism with bottom-up features of a scene: we encoded the object specific bounding boxes provided by the Mask R-CNN model [13] using the Resnet-101 architecture [14]

Read more

Summary

Introduction

The goal of image captioning is to automatically generate descriptions for a given image, i.e., to capture the relationship between the objects present in the image, generate natural language expressions (see an example in Fig. 1), and judge the quality of the generated descriptions. The problem, is seemingly more difficult than popular computer vision tasks, e.g., object detection or segmentation, where the emphasis is solely on identifying the different entities present in the image. We adopt and extend the architecture proposed in [49] since it is the most cited seminal work in the area of image captioning. It introduced the encoderdecoder architecture and the visual attention mechanism for image captioning in a simple yet powerful approach.

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.