This paper was accepted into the “Human-in-the-Loop Learning Workshop” at NeurIPS 2022.
Preference-based reinforcement learning (RL) algorithms help avoid the pitfalls of handcrafted reward functions by distilling them from human preference feedback, but they remain impractical due to the large number of labels required by the human being, even for relatively simple tasks. In this work, we demonstrate that Reward Function Encoding Environment Dynamics (REED) drastically reduces the number of preference tags required in next-generation preference-based RL frameworks. We hypothesize that REED-based methods partition the state-action space better and facilitate generalization to state-action pairs not included in the preference data set. REED iterates between encoding the dynamics of the environment into a state-action representation through a self-supervised temporal coherence task and bootstrapping the preference-based reward function from the state-action representation. Whereas previous approaches train only on preference-tagged pairs of trajectories, REED exposes the action representation of the state to all transitions experienced during policy training. We explore the benefits of REED within the PrefPPO [1] and PEBBLE [2] preference learning frameworks and demonstrate improvements under experimental conditions both in the speed of policy learning and in the final performance of policies. For example, in quadruped-walk and walker-walk with 50 preference tags, REED-based reward functions recover 83% and 66% of the performance of the truth-on-the-ground reward policy and without REED only recovers 38\% and 21%. For some domains, REED-based reward functions result in policies that outperform policies trained on the field truth reward.