Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - The original paper introducing the Proximal Policy Optimization (PPO) algorithm, detailing its objective function, actor-critic structure, and method for policy updates.
High-Dimensional Continuous Control Using Generalized Advantage Estimation, John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel, 2015arXiv preprint arXiv:1506.02438DOI: 10.48550/arXiv.1506.02438 - This paper introduces Generalized Advantage Estimation (GAE), a widely used technique for reducing variance in policy gradient methods, which is integral to the value network's function in PPO.
Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems (NeurIPS) 35DOI: 10.48550/arXiv.2203.02155 - A significant paper that applies Reinforcement Learning from Human Feedback (RLHF), using PPO, to fine-tune large language models, showing practical considerations for policy and value networks.
TRL Documentation: PPO Trainer, Hugging Face, 2024 (Hugging Face) - Official documentation for the Hugging Face TRL library's PPO Trainer, offering practical guidance and code examples for implementing PPO with an AutoModelForCausalLMWithValueHead.