Abstract

Real-world applications of intelligent agents demand accuracy and efficiency, and seldom provide reinforcement signals. Currently, most agent models are reinforcement-based and concentrate exclusively on accuracy. We propose a general-purpose agent model consisting of proprioceptive and perceptual pathways. The agent actively samples its environment via a sequence of glimpses. It completes the partial propriocept and percept sequences observed till each sampling instant, and learns where and what to sample by minimizing prediction error, without reinforcement or supervision (class labels). The model is evaluated by exposing it to two kinds of stimuli: images of fully-formed handwritten numerals and alphabets, and videos of gradual formation of numerals. It yields state-of-the-art prediction accuracy upon sampling only 22.6% of the scene on average. The model saccades when exposed to images and tracks when exposed to videos. This is the first known attention-based agent to generate realistic handwriting with state-of-the-art accuracy and efficiency by interacting with and learning end-to-end from static and dynamic environments.

Highlights

  • Perception and action are inextricably tied together as, in the real world, efficiency is as important as accuracy

  • We propose a predictive agent model, which observes its visual environment via a sequence of glimpses

  • We have considered all recent attentional and non-attentional models that have reported prediction accuracy on the binarized version of MNIST dataset [31] in terms of negative log-likelihood (NLL)

Read more

Summary

INTRODUCTION

Perception and action are inextricably tied together as, in the real world, efficiency is as important as accuracy. Hard-attention models make decisions by processing a part of the data, sampled via a sequence of glimpses These models are reinforcement-based (e.g., [2], [3]), unsupervised (e.g., [4], [5]) or supervised (e.g., [6]). Our function selects the location with maximum information gain at each glimpse This model is supervised (uses class labels). (3) This end-to-end model is efficient in terms of size and number of glimpses required for accurate prediction It learns by sampling locations with maximum information gain at each glimpse. It yields state-of-the-art prediction accuracy upon sampling only 22.6% of the scene on average. It yields 4.9% lower error than the DRAW model on the binarized MNIST benchmark

MODELS AND METHODS
AGENT ARCHITECTURE
1: Recognition Model
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call