Abstract

Image and video captioning are important tasks in visual data analytics, as they concern the capability of describing visual content in natural language. They are the pillars of query answering systems, improve indexing and search and allow a natural form of human-machine interaction. Even though promising deep learning strategies are becoming popular, the heterogeneity of large image archives makes this task still far from being solved. In this paper we explore how visual saliency prediction can support image captioning. Recently, some forms of unsupervised machine attention mechanisms have been spreading, but the role of human attention prediction has never been examined extensively for captioning. We propose a machine attention model driven by saliency prediction to provide captions in images, which can be exploited for many services on cloud and on multimedia data. Experimental evaluations are conducted on the SALICON dataset, which provides groundtruths for both saliency and captioning, and on the large Microsoft COCO dataset, the most widely used for image captioning.

Highlights

  • Replicating the human ability of describing an image in natural language, providing a rich set of details at a first glance, has been one of the primary goals of different research communities in the last years

  • We present a preliminary investigation on the role of saliency prediction in image captioning architectures

  • To investigate the role of visual saliency in the context of attentive captioning models, we extend this schema by splitting the machine attention into saliency and non saliency regions, and learning different weights for both of them

Read more

Summary

Introduction

Replicating the human ability of describing an image in natural language, providing a rich set of details at a first glance, has been one of the primary goals of different research communities in the last years. Captioning models, should be able to solve the challenge of identifying each and every object in the scene, but they should be capable of expressing their names and relationships in natural language. The enormous variety of visual data makes this task challenging. It is very hard, to predict a-priori and only driven by data what could be interesting in an image and what should be described. Describing visual data in natural language opens the door to many future applications: the one with the largest potential impact is that of defining new services for search and retrieval in visual data archives, using query-answering tools, working on natural language as well as improving the performance of more

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.