Inspired by the human vision system and learning, we propose a novel cognitive architecture that understands the content of raw videos in terms of objects without using labels. The architecture achieves four objectives: (1) Decomposing raw frames in objects by exploiting foveal vision and memory. (2) Describing the world by projecting objects on an internal canvas. (3) Extracting relevant objects from the canvas by analyzing the causal relation between objects and rewards. (4) Exploiting the information of relevant objects to facilitate the reinforcement learning (RL) process. In order to speed up learning, and better identify objects that produce rewards, the architecture implements learning by causality from the perspective of Wiener and Granger using object trajectories stored in working memory and the time series of external rewards. A novel non-parametric estimator of directed information using Renyi’s entropy is designed and tested. Experiments on three environments show that our architecture extracts most of relevant objects. It can be thought of as ‘understanding’ the world in an object-oriented way. As a consequence, our architecture outperforms state-of-the-art deep reinforcement learning in terms of training speed and transfer learning.