Abstract

In this paper, we present the task of generating image descriptions with gold standard visual detections as input, rather than directly from an image. This allows the Natural Language Generation community to focus on the text generation process, rather than dealing with the noise and complications arising from the visual detection process. We propose a fine-grained evaluation metric specifically for evaluating the content selection capabilities of image description generation systems. To demonstrate the evaluation metric on the task, several baselines are presented using bounding box information and textual information as priors for content selection. The baselines are evaluated using the proposed metric, showing that the fine-grained metric is useful for evaluating the content selection phase of an image description generation system.

Highlights

  • There has been increased interest in the task of automatically generating full-sentence natural language image descriptions in recent years

  • For the task of classifying an image into one of 1,000 object categories on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC’14) dataset (Russakovsky et al, 2014), the state-ofthe-art currently performs at a 4.82% top-5 error rate (Ioffe and Szegedy, 2015) comparable to the 5.1% error rate of a human annotator who trained himself to recognise the object categories (Russakovsky et al, 2014)

  • We presented the task of generating image descriptions from gold standard labelled bounding boxes as input to a text generation system

Read more

Summary

Introduction

There has been increased interest in the task of automatically generating full-sentence natural language image descriptions in recent years. We assume that perfectly labelled object instances and their localisations are available to image description generation systems, as done in Elliott and Keller (2013) and Yatskar et al (2014) Given this knowledge, we would like to evaluate how well image description generation systems perform through the various stages of Natural Language Generation (Reiter and Dale, 2000): content determination (what objects to describe), microplanning (how to describe objects) and realisation (generating the complete sentence). We would like to evaluate how well image description generation systems perform through the various stages of Natural Language Generation (Reiter and Dale, 2000): content determination (what objects to describe), microplanning (how to describe objects) and realisation (generating the complete sentence) This pilot task is an attempt at encouraging fine-grained evaluation for image descriptions, compared to general-purpose metrics like METEOR (Denkowski and Lavie, 2014) that evaluates text at a global, coarsegrained level. We discuss further challenges with the proposed task, and introduce possible fine-grained metrics to be considered in the future

Motivation and Related Work
Task and Evaluation Measure
Fine-grained Evaluation Metric
Generating Descriptions
Generation based on Visual Cues
Generation based on Textual Priors
Function Words
Experimental Results
Combining Textual and Visual Priors
Discussion and Future
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call