Abstract

In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic “zero-shot” scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel “imagination” module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes.

Highlights

  • Humans do not learn conceptual representations from language alone, but from a wide range of situational information (Beinborn et al, 2018; Bisk et al, 2020) as highlighted by property-listing experiments (McRae et al, 2005)

  • It is no wonder that recent trends in learning conceptual representations adopt multi-modal and holistic approaches (Bruni et al, 2014) wherein abstract distributional lexical representations (Landauer and Dumais, 1997; Laurence and Margolis, 1999) learned from text corpora are augmented or refined with perceptual information for concrete and context-aware representations built from visual (Kiela et al, 2018; Lazaridou et al, 2015), olfactory (Kiela et al, 2015), or auditory (Kiela and Clark, 2015) modalities

  • We show that the new imagination models are state-of-the-art in the recently introduced CompGuessWhat?! benchmark (Suglia et al, 2020) outperforming current models by 8.26%

Read more

Summary

Introduction

Humans do not learn conceptual representations from language alone, but from a wide range of situational information (Beinborn et al, 2018; Bisk et al, 2020) as highlighted by property-listing experiments (McRae et al, 2005). When humans experience the concept of “boat”, they simulate a new representation by reactivating and aggregating multi-modal representations that reside in their memory and are associated with the concept of “boat” (e.g., what a boat looks like, the action of sailing, etc) (Barsalou, 2008). This simulation process is called perceptual simulation. (De Vries et al, 2017) is a prototypical language game of this kind: a Guesser has to identify a target object in a scene represented as an image by asking questions to an Oracle. GDSE does not deliver the desired multi-modality needed, we extend it with our Imagination component to obtain more effective multi-modal object representations

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call