Reinforcement learning algorithms often struggle to learn in partially observable environments, where different states of the environment may appear identical. However, not all partially observable environments pose the same level of difficulty for learning. This work introduces the concept of dissonance distance, a metric that can estimate the difficulty of learning in such environments. We demonstrate that self-information, such as internal oscillations or memory of previous actions, can increase the dissonance distance and make learning easier in partially observable environments. Additionally, sensory occlusion may occur after learning was completed, leading to a lack of sufficient information and catastrophic failure. To address this, we propose a spatially layered architecture (SLA) inspired by the brain, which trains multiple policies in parallel for the same task. SLA can change the amount of external information processed at each timestep, providing an adaptive approach to handle the changing information in the environment state-space. We evaluate the effectiveness of our SLA method showing learnability and robustness against realistic noise and occlusion in sensory inputs for the partially observable Continuous Mountain Car environment. We hypothesize that multi-policy approaches like SLA might explain the complex dopamine dynamics in the brain that cannot be explained with the state of the art scalar Temporal Difference error.
Read full abstract