Learning non-myopically from human-generated reward

W Bradley Knox,Peter Stone

doi:10.1145/2449396.2449422

Abstract

Recent research has demonstrated that human-generated reward signals can be effectively used to train agents to perform a range of reinforcement learning tasks. Such tasks are either episodic - i.e., conducted in unconnected episodes of activity that often end in either goal or failure states - or continuing - i.e., indefinitely ongoing. Another point of difference is whether the learning agent highly discounts the value of future reward - a myopic agent - or conversely values future reward appreciably. In recent work, we found that previous approaches to learning from human reward all used myopic valuation [7]. This study additionally provided evidence for the desirability of myopic valuation in task domains that are both goal-based and episodic.In this paper, we conduct three user studies that examine critical assumptions of our previous research: task episodicity, optimal behavior with respect to a Markov Decision Process, and lack of a failure state in the goal-based task. In the first experiment, we show that converting a simple episodic task to non-episodic (i.e., continuing) task resolves some theoretical issues present in episodic tasks with generally positive reward and - relatedly - enables highly successful learning with non-myopic valuation in multiple user studies. The primary learning algorithm in this paper, which we call VI-TAMER, is it the first algorithm to successfully learn non-myopically from human-generated reward; we also empirically show that such non-myopic valuation facilitates higher-level understanding of the task. Anticipating the complexity of real-world problems, we perform two subsequent user studies - one with a failure state added - that compare (1) learning when states are updated asynchronously with local bias - i.e., states quickly reachable from the agent's current state are updated more often than other states - to (2) learning with the fully synchronous sweeps across each state in the VI-TAMER algorithm. With these locally biased updates, we find that the general positivity of human reward creates problems even for continuing tasks, revealing a distinct research challenge for future work.

Full Text