Abstract

Event Abstract Back to Event Inferring Human Visuomotor Q-functions Constantin Rothkopf1* and Dana Ballard2 1 Frankfurt Institute for Advanced Studies, Germany 2 University of Texas, United States Huge amounts of experimental data show that reinforcement learning is a key component in the organization of animal behavior. Formal reinforcement learning models (RL) potentially can explain how humans can learn to solve a task in an optimal way based on experience accumulated while interacting with the environment. But although RL algorithms can do this for small problems, their state spaces grow exponentially in the number of state variables. This problem has made it difficult to apply RL to realistic settings.One way to improve the situation would be to speed up the learning process by using a tutor. Early RL experiments showed that even small amounts of teaching could be highly effective, but the teaching signal was in the form of correct decisions that covered the agent’s state space, an unrealistic assumption. In the more general problem an agent can only observe some of the actions of a teacher. The problem of taking subsets of expert behavior and estimating the reward functions has been characterized as inverse reinforcement learning (IRL). This problem has been tackled by assuming that the agent has access to a teacher’s base set of features on which to form a policy. Under that assumption the problem can be reduced to trying to find the correct weighting of features to reproduce the teacher’s policy1. The algorithm converges but is very expensive in terms of policy iterations. A subsequent Bayesian approach assumes a general form of the reward function that maximizes the observed data and then samples data in order to optimize the reward function2 by making perturbations in the resultant reward estimates. However this method is also very expensive, as it requires policy iteration in its innermost control loop in order to converge. We make dramatic improvements on2 by using a specific parametric form of reward function in the form of step functions with just a few numbers of step transitions. With these functions policy iteration is not required as the reward function’s parameters can be computed directly. This method also extends to a modular formalism introduced by 3,4 that allows rewards to be estimated individually for subtasks and then used in combination. The algorithm is demonstrated on a humanoid avatar walking on a sidewalk and collecting litter while avoiding obstacles. Previously we had tuned reward functions by hand in order to make the avatar perform the three tasks effectively. We show that reward functions recovered using human data performing the identical task are very close to those used to program the human avatar initially. This demonstrates that it is possible to theorize as to a human’s RL algorithm by implementing that algorithm on a humanoid avatar and then and then test the theory by seeing if the reward structure implied by the human data is commensurate with that of the avatar. Acknowledgments:Supported by NIH Grant R01RR009283

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.