Abstract

Perception of sensory signals is strongly influenced by their context, both in space and time. In this paper, we propose a novel hierarchical model, called convolutional dynamic networks, that effectively utilizes this contextual information, while inferring the representations of the visual inputs. We build this model based on a predictive coding framework and use the idea of empirical priors to incorporate recurrent and top-down connections. These connections endow the model with contextual information coming from temporal as well as abstract knowledge from higher layers. To perform inference efficiently in this hierarchical model, we rely on a novel scheme based on a smoothing proximal gradient method. When trained on unlabeled video sequences, the model learns a hierarchy of stable attractors, representing low-level to high-level parts of the objects. We demonstrate that the model effectively utilizes contextual information to produce robust and stable representations for object recognition in video sequences, even in case of highly corrupted inputs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call