Abstract

Among the current reinforcement learning algorithms, the Policy Gradient algorithm (PG)[7] is one of the traditional and most widely used algorithms, but it has the disadvantage of unstable gradient estimation, and the newly Proximal Policy Optimization algorithm (PPO) [8]solves the problem. It solves the stability problem, but the update policy is slow, and it is easy to produce over-fitting when the training times are too many. In this article, a new method is proposed, referring to Asynchronous Advantage Actor-Critic (A3C)[9], the basic PPO algorithm is trained in parallel, and a method that considers future rewards is introduced, and the future rewards are also calculated to the current. In the reward, through the OPEN GYM platform and the experimental results of capturing at any position of the robotic arm, our actions can ensure a faster training speed, while also avoiding overfitting during long-term training.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call