Abstract

Hierarchical reinforcement learning (HRL) can learn the decomposed subpolicies corresponding to the local state-space; therefore, it is a promising solution to complex robotic assembly control tasks with fewer interactions with environments. Most existing HRL algorithms often require on-policy learning, where resampling is necessary for every training step. In this article, we propose a data-efficient HRL via off-policy learning with three main contributions. First, two augmented MDPs (Markov decision processes) are reformulated to learn the higher level policy and lower level policy from the same samples. Second, to learn higher level policy that leads to efficient exploration, a softmax gating policy is derived to determine the lower level policy for interacting with the environment. Third, to learn the lower level policies via off-policy samples from one lower level replay buffer, the higher level policy derived by the option-value network is adopted to select the appropriate option for learning the corresponding lower level policy. The data-efficiency performance of our algorithm is validated on two simulations and real-world robotic dual peg-in-hole assembly tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call