Abstract

We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions. We investigate two related tasks: image caption generalization and generation, where the former is an optional subtask of the latter. The high-level idea of our approach is to harvest expressive phrases (as tree fragments) from existing image descriptions, then to compose a new description by selectively combining the extracted (and optionally pruned) tree fragments. Key algorithmic components are tree composition and compression, both integrating tree structure with sequence structure. Our proposed system attains significantly better performance than previous approaches for both image caption generalization and generation. In addition, our work is the first to show the empirical benefit of automatically generalized captions for composing natural image descriptions.

Highlights

  • The web is increasingly visual, with hundreds of billions of user contributed photographs hosted online

  • We tap into the last kind of text, using naturally occuring pairs of images with natural language descriptions to compose expressive descriptions for query images via tree composition and compression

  • We model image caption generalization as sentence compression, in practical applications we may want the outputs of these two tasks to be different

Read more

Summary

Introduction

The web is increasingly visual, with hundreds of billions of user contributed photographs hosted online. The second direction, in a complementary avenue to the first, has explored ways to make use of the rich spectrum of visual descriptions contributed by online citizens (Kuznetsova et al, 2012; Feng and Lapata, 2013; Mason, 2013; Ordonez et al, 2011) In these approaches, the set of what can be described can be substantially larger than the set of what can be recognized, where the former is shaped and defined by the data, rather than by humans. The high-level idea of our system is to harvest useful bits of text (as tree fragments) from existing image descriptions using detected visual content similarity, and to compose a new description by selectively combining these extracted (and optionally pruned) tree fragments This overall idea of composition based on extracted phrases is not new in itself (Kuznetsova et al, 2012), we make several technical and empirical contributions. Our work results in an improved image caption corpus with automatic generalization, which is publicly available.

Harvesting Tree Fragments
Tree Composition
ILP Variables
Discussion
Tree Compression
Dynamic Programming
Branch Deletion Probabilities
Experiments
Method
Human Evaluation
Related Work
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call