Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - The original paper introducing the Proximal Policy Optimization (PPO) algorithm, which is the core method for RL fine-tuning.
PPO with 🤗 TRL, Hugging Face, 2024 (Hugging Face) - Official documentation for using Proximal Policy Optimization (PPO) with the Hugging Face TRL library, providing practical implementation details for RLHF.