Abstract
What does it mean to produce a good description of an image? Is a description good because it correctly identifies all of the objects in the image, because it describes the interesting attributes of the objects, or because it is short, yet informative? Grice’s Cooperative Principle, stated as “Make your contribution such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged” (Grice, 1975), alongside other ideas of pragmatics in communication, have proven useful in thinking about language generation (Hovy, 1987; McKeown et al., 1995). The Cooperative Principle provides one possible framework for thinking about the generation and evaluation of image descriptions.1 The immediate question is whether automatic image description is within the scope of the Cooperative Principle. Consider the task of searching for images using natural language, where the purpose of the exchange is for the user to quickly and accurately find images that match their information needs. In this scenario, the user formulates a complete sentence query to express their needs, e.g. A sheepdog chasing sheep in a field, and initiates an exchange with the system in the form of a sequence of one-shot conversations. In this exchange, both participants can describe images in natural language, and a successful outcome relies on each participant succinctly and correctly expressing their beliefs about the images. It follows from this that we can think of image description as facilitating communication between people and computers, and thus take advantage of the Principle’s maxims of Quantity, Quality, Relevance, and Manner in guiding the development and evaluation of automatic image description models. An overview of the image description literature from the perspective of Grice’s maxims can be found in Table 1. The most apparent ommission is the lack of research devoted to generating minimally informative descriptions: the maxim of Quantity. Attending to this maxim will become increasingly important as the quality and coverage of object, attribute, and scene detectors increases. It would be undesirable to develop models that describe every detected object in an image because that would be likely to violate the maxim of Quantity (Spain and Perona, 2010). Similarly, if it is possible to associate an accurate attribute with each object in the image, it will be important to be sparing in the application of those attributes: is it relevant to describe “furry” sheep when there are no sheared sheep in an image? How should image description models be evaluated with respect to the maxims of the Cooperative Principle? So far model evaulation has focused on automatic text-based measures, such as Unigram BLEU and human judgements of semantic correctness (see Hodosh et al. (2013) for discussion of framing image description as a ranking task, and Elliott and Keller (2014) for a correlation analysis of text-based measures against human judgements). The semantic correctness judgements task typically present a variant of “Rate the relevance of the description for this image”, which only evaluates the description visa-vis the maxim of Relevance. One exception is the study of Mitchell et al. (2012), in which judgements about the ordering of noun phrases (the maxim of Manner) were also collected. The importance of being able to evaluate according to multiple maxims becomes clearer as computer vision becomes more accurate. It seems intuitive that a model that describes and relates every object in the image could be characterised as generating Relevant and Quality descriptions, but not necessarily descriptions of
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.