On Visually-Grounded Reference Production: Testing the Effects of Perceptual Grouping and 2D/3D Presentation Mode.

Ruud Koolen

doi:10.3389/fpsyg.2019.02247

Abstract

When referring to a target object in a visual scene, speakers are assumed to consider certain distractor objects to be more relevant than others. The current research predicts that the way in which speakers come to a set of relevant distractors depends on how they perceive the distance between the objects in the scene. It reports on the results of two language production experiments, in which participants referred to target objects in photo-realistic visual scenes. Experiment 1 manipulated three factors that were expected to affect perceived distractor distance: two manipulations of perceptual grouping (region of space and type similarity), and one of presentation mode (2D vs. 3D). In line with most previous research on visually-grounded reference production, an offline measure of visual attention was taken here: the occurrence of overspecification with color. The results showed effects of region of space and type similarity on overspecification, suggesting that distractors that are perceived as being in the same group as the target are more often considered relevant distractors than distractors in a different group. Experiment 2 verified this suggestion with a direct measure of visual attention, eye tracking, and added a third manipulation of grouping: color similarity. For region of space in particular, the eye movements data indeed showed patterns in the expected direction: distractors within the same region as the target were fixated more often, and longer, than distractors in a different region. Color similarity was found to affect overspecification with color, but not gaze duration or the number of distractor fixations. Also the expected effects of presentation mode (2D vs. 3D) were not convincingly borne out by the data. Taken together, these results provide direct evidence for the close link between scene perception and language production, and indicate that perceptual grouping principles can guide speakers in determining the distractor set during reference production.

Highlights

Definite object descriptions are an important part of everyday communication, where speakers often produce them to identify objects in the physical world around them
The results show that the presentation mode to some extent affected the use of the redundant attribute color, but this effect was only significant by items [F1(1, 46) = 2.73, n.s.; F2(1, 12) = 39.71, p < 0.001, ηp2 = 0.77; minF′(1, 52) = 2.55, n.s.]
The two experiments described in this paper explored the impact of perceptual grouping and presentation mode (2D vs. 3D) on how speakers perceive distance between objects in a visual scene when referring to objects

Summary

Introduction

Definite object descriptions (such as “the green bowl” or “the large green bowl”) are an important part of everyday communication, where speakers often produce them to identify objects in the physical world around them. Solving the referential task here requires content selection: the speaker must decide on the attributes that she includes to distinguish the bowl from any distractor object that is present in the scene (such as the other bowl, the plate, and the chairs). This notion of content selection does reflect human referential behavior, but is at the heart of computational models for referring expression generation. Most notably the Incremental Algorithm (Dale and Reiter, 1995), typically seek attributes with which a target object can be distinguished from its surrounding distractors, aiming to collect a set of attributes with which any distractor that is present in the scene is ruled out (Van Deemter et al, 2012)

Methods

Results

Discussion

Conclusion