Using Deep Deterministic Policy Gradient (DDPG) to Train a Double-Jointed Arm to Reach Target Locations in the Unity ML-Agents Reacher Environment

Oluwaseyi Awoga

doi:10.2139/ssrn.3885007

Abstract

The Q-learning process including some of its variants such as Deep Q-Network (DQN) uses an indirect approach to train reinforcement learning agents. Generally, we calculate the Q-values for state-action pairs and then take the maximum or highest value from the list of actions possible in an environment. This is an indirect approach, perhaps inefficient or intractable for training agents with continuous action space. Put another way, Q-learning based methods are only ideal for problems with discrete action spaces. (Google DeepMind/Lillicrap et al., 2016) in their paper entitled “Continuous Control with Deep Reinforcement Learning” in fact argued that “while DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces”. They stated further that “DQN cannot be straightforwardly applied to continuous domains since it relies on finding the action that maximizes the action-value function, which in the continuous-valued case requires an iterative optimization process at every step”. But what if we could bypass calculating Q-values and directly learn a policy from the environment? That is, instead of training our system to generate action values, we instead train it to generate the probability of selecting the right actions? It turns out that this paradigm of training agents actually exists. (Zai & Brown 2020) stated that “this class of algorithms is called policy gradient methods”.

Full Text