Abstract

One of the most rapidly advancing areas of deep learning research aims at creating models that learn to disentangle the latent factors of variation from a data distribution. However, modeling joint probability mass functions is usually prohibitive, which motivates the use of conditional models assuming that some information is given as input. In the domain of numerical cognition, deep learning architectures have successfully demonstrated that approximate numerosity representations can emerge in multi-layer networks that build latent representations of a set of images with a varying number of items. However, existing models have focused on tasks requiring to conditionally estimate numerosity information from a given image. Here, we focus on a set of much more challenging tasks, which require to conditionally generate synthetic images containing a given number of items. We show that attention-based architectures operating at the pixel level can learn to produce well-formed images approximately containing a specific number of items, even when the target numerosity was not present in the training distribution.

Highlights

  • In recent years, there has been a growing interest in the challenging problem of unsupervised representation learning [1]

  • These results are well-aligned with the existing empirical literature on human behavior, which suggests that numerosity estimates are distributed around the target mean and variability tends to increase with numerosity [42,44], and that numerosity estimation can be altered by confounding non-numerical magnitudes [21,25]

  • We investigated whether state-of-the-art deep learning architectures based on attention mechanisms could learn disentangled representations of numerosity from a set of images containing a variable number of items

Read more

Summary

Introduction

There has been a growing interest in the challenging problem of unsupervised representation learning [1]. The full potential of deep generative models was revealed by the introduction of variational autoencoders (VAE) [13] and generative adversarial networks (GAN) [14], which can discover and factorize extremely abstract attributes from the data [15,16]. These architectures can be further extended to promote the emergence of even more disentangled representations, such as in beta-VAE [17] and InfoGAN [18], or can exploit attention mechanisms to produce meaningful decompositions of complex visual scenes [19]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call