Generating Image Descriptions with Gold Standard Visual Inputs: Motivation, Evaluation and Baselines

Josiah Wang,Robert Gaizauskas

doi:10.18653/v1/w15-4722

Abstract

In this paper, we present the task of generating image descriptions with gold standard visual detections as input, rather than directly from an image. This allows the Natural Language Generation community to focus on the text generation process, rather than dealing with the noise and complications arising from the visual detection process. We propose a fine-grained evaluation metric specifically for evaluating the content selection capabilities of image description generation systems. To demonstrate the evaluation metric on the task, several baselines are presented using bounding box information and textual information as priors for content selection. The baselines are evaluated using the proposed metric, showing that the fine-grained metric is useful for evaluating the content selection phase of an image description generation system.

Highlights

There has been increased interest in the task of automatically generating full-sentence natural language image descriptions in recent years
For the task of classifying an image into one of 1,000 object categories on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC’14) dataset (Russakovsky et al, 2014), the state-ofthe-art currently performs at a 4.82% top-5 error rate (Ioffe and Szegedy, 2015) comparable to the 5.1% error rate of a human annotator who trained himself to recognise the object categories (Russakovsky et al, 2014)
We presented the task of generating image descriptions from gold standard labelled bounding boxes as input to a text generation system

Summary

Introduction

There has been increased interest in the task of automatically generating full-sentence natural language image descriptions in recent years. We assume that perfectly labelled object instances and their localisations are available to image description generation systems, as done in Elliott and Keller (2013) and Yatskar et al (2014) Given this knowledge, we would like to evaluate how well image description generation systems perform through the various stages of Natural Language Generation (Reiter and Dale, 2000): content determination (what objects to describe), microplanning (how to describe objects) and realisation (generating the complete sentence). We would like to evaluate how well image description generation systems perform through the various stages of Natural Language Generation (Reiter and Dale, 2000): content determination (what objects to describe), microplanning (how to describe objects) and realisation (generating the complete sentence) This pilot task is an attempt at encouraging fine-grained evaluation for image descriptions, compared to general-purpose metrics like METEOR (Denkowski and Lavie, 2014) that evaluates text at a global, coarsegrained level. We discuss further challenges with the proposed task, and introduce possible fine-grained metrics to be considered in the future

Motivation and Related Work

Task and Evaluation Measure

Fine-grained Evaluation Metric

Generating Descriptions

Generation based on Visual Cues

Generation based on Textual Priors

Function Words

Experimental Results

Combining Textual and Visual Priors

Discussion and Future

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Generating Image Descriptions with Gold Standard Visual Inputs: Motivation, Evaluation and Baselines

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2015
Citations: 20	License type: cc-by

Similar Papers

Visual Detection of Chart-Recorded Signals in the Presence of Noise
Harold Blatter
IEEE Transactions on Geoscience Electronics | VOL. 5
Harold BlatterHarold Blatter
01 Dec 1967
IEEE Transactions on Geoscience Electronics | VOL. 5

A general description generator for human activity images based on deep understanding framework
Zheng Zhou ... Kan Li
Neural Computing & Applications | VOL. 28
Zheng Zhou, et. al.Zheng Zhou ... Kan Li
19 Jan 2016
Neural Computing & Applications | VOL. 28

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze
Ece Takmaz ... Lisa Beinborn
-
Ece Takmaz, et. al.Ece Takmaz ... Lisa Beinborn
01 Jan 2020
01 Jan 2020

Performance evaluation of an Artificial Neural Network automatic spindle detection system
...
-
, et. al. ...
01 Aug 2012
01 Aug 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Generating Image Descriptions with Gold Standard Visual Inputs: Motivation, Evaluation and Baselines

Abstract

Highlights

Summary

Talk to us

Similar Papers