Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXivDOI: 10.48550/arXiv.2203.02155 - This seminal paper from OpenAI describes the Reinforcement Learning from Human Feedback (RLHF) pipeline, using PPO, which DPO aims to simplify.
Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXivDOI: 10.48550/arXiv.1707.06347 - Introduces Proximal Policy Optimization (PPO), a widely used reinforcement learning algorithm for policy optimization, which is a key component of the traditional RLHF pipeline.