Preference-based reinforcement learning (RL) trains agents using non-expert feedback without the need for detailed reward design. Human teacher guides the agent by comparing two behavior trajectories and labeling the preference. Although recent studies improved feedback efficiency through methods like unsupervised exploration and self- or semi-supervised learning, they often assume flawless human annotation. In practice, human teachers might make mistakes or have conflicting opinions on trajectory preferences, posing a challenge in capturing user intent. To address this, we introduce mixing corrupted preferences (MCP) for robust and feedback-efficient preference-based RL. Inspired by the robustness of mixup against corrupted labels, MCP offers three key advantages. Firstly, by component-wise mixing of two labeled preferences, MCP mitigates the impact of corrupted feedback, enhancing robustness. Secondly, MCP improves feedback efficiency by generating unlimited new data even with limited labeled feedback. Lastly, MCP helps regulate overconfidence in preference predictor, moderating excessive reward divergence between two trajectories. We evaluate our method on three locomotion and six robotic manipulation tasks in B-Pref benchmark, in contexts with both perfectly rational and imperfect teachers (including actual human teachers). Our results show that MCP significantly outperforms PEBBLE, requiring fewer feedback instances and a shorter training period, highlighting its superior feedback efficiency. Our code is available at https://github.com/JongKook-Heo/MCP.
Read full abstract