A conceptual framework of computations in mid-level vision.

Jonas Kubilius,Johan Wagemans,Hans P Op De Beeck

doi:10.3389/fncom.2014.00158

Jonas Kubilius, Johan Wagemans + Show 1 more

Open Access

https://doi.org/10.3389/fncom.2014.00158

Copy DOI

Journal: Frontiers in computational neuroscience	Publication Date: Dec 12, 2014
Citations: 25	License type: cc-by

Affiliation: KU Leuven

Abstract

If a picture is worth a thousand words, as an English idiom goes, what should those words—or, rather, descriptors—capture? What format of image representation would be sufficiently rich if we were to reconstruct the essence of images from their descriptors? In this paper, we set out to develop a conceptual framework that would be: (i) biologically plausible in order to provide a better mechanistic understanding of our visual system; (ii) sufficiently robust to apply in practice on realistic images; and (iii) able to tap into underlying structure of our visual world. We bring forward three key ideas. First, we argue that surface-based representations are constructed based on feature inference from the input in the intermediate processing layers of the visual system. Such representations are computed in a largely pre-semantic (prior to categorization) and pre-attentive manner using multiple cues (orientation, color, polarity, variation in orientation, and so on), and explicitly retain configural relations between features. The constructed surfaces may be partially overlapping to compensate for occlusions and are ordered in depth (figure-ground organization). Second, we propose that such intermediate representations could be formed by a hierarchical computation of similarity between features in local image patches and pooling of highly-similar units, and reestimated via recurrent loops according to the task demands. Finally, we suggest to use datasets composed of realistically rendered artificial objects and surfaces in order to better understand a model's behavior and its limitations.

Highlights

If a picture is worth a thousand words, as an English idiom goes, what should those words—or, rather, descriptors—capture? What format of image representation would be sufficiently rich if we were to reconstruct the essence of images from their descriptors? In this paper, we set out to develop a conceptual framework that would be: (i) biologically plausible in order to provide a better mechanistic understanding of our visual system; (ii) sufficiently robust to apply in practice on realistic images; and (iii) able to tap into underlying structure of our visual world
The constructed surfaces may be partially overlapping to compensate for occlusions and are ordered in depth. We propose that such intermediate representations could be formed by a hierarchical computation of similarity between features in local image patches and pooling of highly-similar units, and reestimated via recurrent loops according to the task demands
Efforts to understand how this is possible have led to the so-called standard view of the primate visual system where objects are rapidly extracted from images by a hierarchy of linear and non-linear processing stages, where simple and specific features are combined in a non-linear fashion, resulting in increasingly more complex and more transformationtolerant features (Fukushima, 1980; Marr, 1982; Ullman and Basri, 1991; Riesenhuber and Poggio, 1999; DiCarlo and Cox, 2007; DiCarlo et al, 2012; see Kreiman, 2013, for a review)

Summary

COMPUTATIONAL NEUROSCIENCE

For this condition, von der Heydt et al (1984) reported neurons in V2 responding to these illusory contours, and, nearly as vigorously as to the luminance-defined ones If these examples appear only as curious cases of feature inference in artificial setups, imagine a typical cluttered image where multiple objects are partially occluded. This observation holds for a more realistic image depicted, where we can agree that five objects situated in different depth planes are depicted. We do not claim that recognition is irrelevant for segmentation, as it has been shown that recognition can bias figure-ground assignment (Peterson, 1994), but our point is that it can largely be done successfully without any knowledge about the identity of objects

CONCLUSION

NEURAL REPRESENTATION OF POOLED UNITS

Findings

LIMITATIONS AND CONCLUSION