Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Introduces the Proximal Policy Optimization algorithm, which forms the foundation for the PPO fine-tuning approach described in the section.
TRL - Hugging Face Documentation, Hugging Face, 2024 - The official documentation for the TRL library, providing details on its architecture, components like PPOConfig and PPOTrainer, and usage examples.
High-Dimensional Continuous Control Using Generalized Advantage Estimation, John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel, 2015International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1506.02438 - Presents Generalized Advantage Estimation (GAE), a significant technique for calculating advantages in policy gradient methods like PPO, explicitly mentioned in the section.