This paper was accepted into the “Human in the Loop Learning Workshop” at NeurIPS 2022.
The specification of reward functions for reinforcement learning is a challenging task that is overlooked in the framework of preference-based learning methods that instead learn from preference labels in path queries. These methods, however, still suffer from high preference tag requirements and would often still achieve low reward recovery. We present the PRIOR framework that alleviates the problems of impractical number of queries to humans as well as poor reward recovery through precomputing on the reward function based on environment dynamics and a classification model of substitute preference. We found that imposing these antecedents as soft constraints significantly reduces the queries made to the human in the loop and improves overall reward recovery. Furthermore, we investigate the use of an abstract state space for the computation of these antecedents to further improve agent performance.