Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - The original paper introducing the PPO algorithm, detailing its objective clipping and KL penalty mechanisms for stable and efficient policy optimization.
Training LMs with Human Feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems (NeurIPS 2022), Vol. 35DOI: 10.48550/arXiv.2203.02155 - A landmark paper that demonstrates using PPO for fine-tuning large language models with human preferences, forming the basis of RLHF in LLMs.
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (MIT Press) - A standard textbook on reinforcement learning, providing foundational understanding of policy gradient methods, actor-critic architectures, and other core RL concepts.