Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Explains the core PPO algorithm, including the role of the old policy and value function in stabilizing training.
Distributed Communication Package - torch.distributed, PyTorch Authors, 2024 (PyTorch) - Provides details on implementing distributed training in PyTorch, covering gradient aggregation and process synchronization.