Trust Region Policy Optimization, John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel, 2015Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol. 37 (PMLR (Proceedings of Machine Learning Research)) - 这篇基础论文介绍了信任区域策略优化(TRPO),详述了其理论基础、约束优化和实际近似方法。
A Natural Policy Gradient, Sham M. Kakade, 2001Advances in Neural Information Processing Systems, Vol. 14 (NeurIPS) - 这篇论文提出了使用Fisher信息矩阵的自然策略梯度的理论基础,对TRPO的约束优化方法有影响。