Abstract

Value function approximation, such as Q-learning, is widely used in the discrete control rather than the continuous one because the optimal action in the discrete setting is more easily selected. Optimizing the action is a non-convex optimization problem with respect to the complex value function. Some notable studies simplify the non-convex optimization problem by assuming the value function as quadratic in the actions or by discretizing the action space. However, the performance of the output policy will decline if these studies’ premises do not hold. In order to address the problem, we propose a framework that combines swarm intelligence algorithms with value-based Reinforcement Learning, where the swarm intelligence algorithms are employed to search for the optimal action with respect to the state and the value function. To ensure the correctness of this framework, we conditionally claim the convergence rate of swarm intelligence algorithms with high probability. We then implement it by searching the batch optimal actions to various states on the GPU platform for the batch training. Furthermore, we employ the population-based atomic actions for the compatibility with the existing related work about solving discrete control problems. Four classical control models and four robot simulation environments are utilized in the comparisons. According to empirical results, our framework outputs a policy comparable with that of the policy-based algorithms by 10% timesteps in the continuous control. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —This paper is motivated by the exploration-exploitation dilemma of Reinforcement Learning to solve continuous control tasks. To balance the exploration and exploitation, the stochastic exploration and the prioritized exploration are roughly two feasible ways, where the prioritized one is a better choice due to the higher data efficiency than the stochastic one, e.g. <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\varepsilon$</tex-math> </inline-formula> -greedy. Normally, the prioritized exploration works well in the value-based Reinforcement Learning algorithms rather than the policy-based ones; meanwhile, the policy-based algorithms are more suitable to continuous control tasks than the value-based ones. To tackle this conflict, we especially design a particle swarm optimization to maximize the Q-value of action in Q-learning. Our design can be hybridized by various swarm intelligence and value-based Reinforcement Learning algorithms. Also, it can be embedded in most intelligent control systems easily. The aim of this study is to solve the continuous control tasks by value-based algorithms as the first step of applying the prioritized exploration. The simulative results verify the effectiveness and efficiency of our design.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call