Abstract

Learning from small amounts of samples with reinforcement learning (RL) is challenging in many tasks, especially in real-world applications, such as robotics. Meta-Reinforcement Learning (meta-RL) has been proposed as an approach to address this problem by generalizing to new tasks through experience from previous similar tasks. However, these approaches generally perform meta-optimization by focusing direct policy search methods on validation samples from adapted policies, thus requiring large amounts of on-policy samples during meta-training. To this end, we propose a novel algorithm called Supervised Meta-Reinforcement Learning with Trajectory Optimization (SMRL-TO) by integrating Model-Agnostic Meta-Learning (MAML) and iLQR-based trajectory optimization. Our approach is designed to provide online supervision for validation samples through iLQR-based trajectory optimization and embed simple imitation learning into the meta-optimization rather than policy gradient steps. This is actually a bi-level optimization that needs to calculate several gradient updates in each meta-iteration, consisting of off-policy reinforcement learning in the inner loop and online imitation learning in the outer loop. SMRL-TO can achieve significant improvements in sample efficiency without human-provided demonstrations, due to the effective supervision from iLQR-based trajectory optimization. In this paper, we describe how to use iLQR-based trajectory optimization to obtain labeled data and then how leverage them to assist the training of meta-learner. Through a series of robotic manipulation tasks, we further show that compared with the previous methods, the proposed approach can substantially improve sample efficiency and achieve better asymptotic performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call