Abstract

We combine a neural image captioner with a Rational Speech Acts (RSA) model to make a system that is pragmatically informative: its objective is to produce captions that are not merely true but also distinguish their inputs from similar images. Previous attempts to combine RSA with neural image captioning require an inference which normalizes over the entire set of possible utterances. This poses a serious problem of efficiency, previously solved by sampling a small subset of possible utterances. We instead solve this problem by implementing a version of RSA which operates at the level of characters (“a”, “b”, “c”, ...) during the unrolling of the caption. We find that the utterance-level effect of referential captions can be obtained with only character-level decisions. Finally, we introduce an automatic method for testing the performance of pragmatic speaker models, and show that our model outperforms a non-pragmatic baseline as well as a word-level RSA captioner.

Highlights

  • The success of automatic image captioning (Farhadi et al, 2010; Mitchell et al, 2012; Karpathy and Fei-Fei, 2015; Vinyals et al, 2015) demonstrates compellingly that end-to-end statistical models can align visual information with language

  • We present a neural image captioning system1 that is a pragmatic speaker as defined by the Rational Speech Acts (RSA) model

  • Advantage of Incremental RSA We observe that 66% percent of the times in which the S1 caption is referentially successful and the S0

Read more

Summary

Introduction

The success of automatic image captioning (Farhadi et al, 2010; Mitchell et al, 2012; Karpathy and Fei-Fei, 2015; Vinyals et al, 2015) demonstrates compellingly that end-to-end statistical models can align visual information with language. The RSA speaker achieves this by reasoning about what is true and about what it’s like to be a listener in this context trying to identify the target This core idea underlies much work in referring expression generation (Dale and Reiter, 1995; Monroe and Potts, 2015; Andreas and Klein, 2016; Monroe et al, 2017) and image captioning (Mao et al, 2016a; Vedantam et al, 2017), but these models do not fully confront the fact that the agents must reason about all possible utterances, which is intractable. We show that such character-level RSA speakers are more effective than literal captioning systems at the task of helping a reader identify the target image among close competitors, and outperform word-level RSA captioners in both efficiency and accuracy

Bayesian Pragmatics for Captioning
Applying Bayesian Pragmatics to a Neural Semantics
Step-Wise Inference
Evaluation
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call