Abstract

Most previous work on artificial curiosity (AC) and intrinsic motivation focuses on basic concepts and theory. Experimental results are generally limited to toy scenarios, such as navigation in a simulated maze, or control of a simple mechanical system with one or two degrees of freedom. To study AC in a more realistic setting, we embody a curious agent in the complex iCub humanoid robot. Our novel reinforcement learning (RL) framework consists of a state-of-the-art, low-level, reactive control layer, which controls the iCub while respecting constraints, and a high-level curious agent, which explores the iCub's state-action space through information gain maximization, learning a world model from experience, controlling the actual iCub hardware in real-time. To the best of our knowledge, this is the first ever embodied, curious agent for real-time motion planning on a humanoid. We demonstrate that it can learn compact Markov models to represent large regions of the iCub's configuration space, and that the iCub explores intelligently, showing interest in its physical constraints as well as in objects it finds in its environment.

Highlights

  • Reinforcement Learning (RL) (Barto et al, 1983; Sutton and Barto, 1998; Kaelbling et al, 1996) allows an agent in an environment to learn a policy to maximize some sort of reward

  • The configuration of the robot must remain on the verified network of samples and trajectories at all times, or there may be unwanted collisions. This implies that all the trajectories in the graph must be controllable, which is in general difficult to verify in simulation for complex robots, such as the iCub, which exhibit non-linear dynamics and are very difficult to model faithfully

  • The primary question addressed by this first experiment is: “To what extent does artificial curiosity (AC) help the agent learn the state transition probabilities for the Markov Decision Process (MDP) planner in this real-world setting?”

Read more

Summary

Introduction

Reinforcement Learning (RL) (Barto et al, 1983; Sutton and Barto, 1998; Kaelbling et al, 1996) allows an agent in an environment to learn a policy to maximize some sort of reward. Rather than optimizing the policy directly, many RL algorithms instead learn a value function, defined as expected future discounted cumulative reward. Much of early RL research focused on discrete states and actions instead of continuous ones dealt with by function approximation and feature-based representations. Directed exploration methods can focus the agent’s efforts on novel regions. They include the classic and often effective optimistic initialization, go-to the least-visited state, and go-to the least recently visited state

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call