Reinforcement learning from human feedback (RLHF) is an effective approach to align language models with human preferences. Fundamental to RLHF is learning a reward function to rate human preferences. Two main approaches to learning a reward model are 1) training an explicit reward model as in RLHF and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Previous work has shown that the DPO implicit reward model can approximate a trained reward model, but it is unclear to what extent the DPO can generalize to distribution changes, a problem that may occur due to limited preference data or changes. in the language of the trained model. . We address this question by comparing the accuracy in distinguishing preferred and rejected responses using DPO and RLHF rewards. Our findings indicate that DPO implicit reward performs similarly to RLHF rewards on distribution data, but significantly underperforms RLHF reward models. Across five out-of-domain configurations, DPO has a mean drop in accuracy of 3% and a maximum drop of 7%, highlighting the shortcomings of DPO's implicit reward model for preference optimization. These findings highlight that the implicit reward model of DPO has limited generalizability and grounds the integration of an explicit reward model into iterative DPO approaches.