Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Introduces Proximal Policy Optimization (PPO), the core reinforcement learning algorithm widely used in RLHF, explaining the role of KL divergence in policy updates.