Stabilizing deep Q-learning with Q-graph-based bounds

Sabrina Hoppe,Robert Krug,Marc Toussaint,Markus Giftthaler

doi:10.1177/02783649231185165

Abstract

State-of-the art deep reinforcement learning has enabled autonomous agents to learn complex strategies from scratch on many problems including continuous control tasks. Deep Q-networks (DQN) and deep deterministic policy gradients (DDPGs) are two such algorithms which are both based on Q-learning. They therefore all share function approximation, off-policy behavior, and bootstrapping—the constituents of the so-called deadly triad that is known for its convergence issues. We suggest to take a graph perspective on the data an agent has collected and show that the structure of this data graph is linked to the degree of divergence that can be expected. We further demonstrate that a subset of states and actions from the data graph can be selected such that the resulting finite graph can be interpreted as a simplified Markov decision process (MDP) for which the Q-values can be computed analytically. These Q-values are lower bounds for the Q-values in the original problem, and enforcing these bounds in temporal difference learning can help to prevent soft divergence. We show further effects on a simulated continuous control task, including improved sample efficiency, increased robustness toward hyperparameters as well as a better ability to cope with limited replay memory. Finally, we demonstrate the benefits of our method on a large robotic benchmark with an industrial assembly task and approximately 60 h of real-world interaction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Stabilizing deep Q-learning with Q-graph-based bounds

Abstract

Talk to us

Similar Papers

More From: The International Journal of Robotics Research

Lead the way for us

Similar Papers

Introduction to Reinforcement Learning
Zihan Ding ... Hao Dong
-
Zihan Ding, et. al.Zihan Ding ... Hao Dong
01 Jan 2020
01 Jan 2020

Temporal Consistency-Based Loss Function for Both Deep Q-Networks and Deep Deterministic Policy Gradients for Continuous Actions
Chayoung Kim
Symmetry | VOL. 13
Chayoung KimChayoung Kim
13 Dec 2021
Symmetry | VOL. 13

Variants of Bellman equation on reinforcement learning problems
Zhen Zhao
-
Zhen ZhaoZhen Zhao
11 Nov 2022
11 Nov 2022

Proposal and Evaluation of the Improved Penalty Avoiding Rational Policy Making Algorithm
...
-
, et. al. ...
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stabilizing deep Q-learning with Q-graph-based bounds

Abstract

Talk to us

Similar Papers

More From: The International Journal of Robotics Research