Abstract

In this article, we consider a subclass of partially observable Markov decision process (POMDP) problems which we termed confounding POMDPs. In these types of POMDPs, temporal difference (TD)-based reinforcement learning (RL) algorithms struggle, as TD error cannot be easily derived from observations. We solve these types of problems using a new bio-inspired neural architecture that combines a modulated Hebbian network (MOHN) with deep Q-network (DQN), which we call modulated Hebbian plus Q-network architecture (MOHQA). The key idea is to use a Hebbian network with rarely correlated bio-inspired neural traces to bridge temporal delays between actions and rewards when confounding observations and sparse rewards result in inaccurate TD errors. In MOHQA, DQN learns low-level features and control, while the MOHN contributes to high-level decisions by associating rewards with past states and actions. Thus, the proposed architecture combines two modules with significantly different learning algorithms, a Hebbian associative network and a classical DQN pipeline, exploiting the advantages of both. Simulations on a set of POMDPs and on the Malmo environment show that the proposed algorithm improved DQN's results and even outperformed control tests with advantage-actor critic (A2C), quantile regression DQN with long short-term memory (QRDQN + LSTM), Monte Carlo policy gradient (REINFORCE), and aggregated memory for reinforcement learning (AMRL) algorithms on most difficult POMDPs with confounding stimuli and sparse rewards.

Highlights

  • This section reports the analysis of how (i) learning mechanisms in modulated Hebbian network (MOHN) compare to those of REINFORCE and deep Q-network [30] (DQN); (ii) the MOHN and new loss function enhances the features from DQN to solve the partially observable Markov decision process (POMDP) problems; (iii) the modulated Hebbian plus Q network architecture (MOHQA) compares against DQN, QRDQN+LSTM, REINFORCE, A2C and aggregated memory for reinforcement learning (AMRL) in the CT-graph and Malmo benchmarks

  • The MOHN’s learning mechanisms i) Hebbian learning, ii) eligibility traces and iii) rare correlations are contrasted against two other classical learning methods i) temporal difference (TD) learning in the form of DQN and ii) policy gradient in the form of REINFORCE

  • This paper considers solving confounding POMDPs using a new neural architecture (MOHQA) for deep reinforcement learning

Read more

Summary

Introduction

Findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA). Nicholas Ketz and Praveen Pilly are with with Information and Systems Sciences Laboratory, HRL Laboratories, 3011 Malibu Canyon Road, Malibu, CA 90265, USA. Soheil Kolouri is with the Computer Science Department at Vanderbilt University, Nashville, TN, 37235. This research was performed when he was with the Information and Systems Sciences Laboratory, HRL Laboratories, Malibu, CA, 90265.

Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.