Lecture 6: Policy Gradient, David Silver, 2015UCL Course on Reinforcement Learning (University College London) - 大卫·西尔弗的系列讲座影响深远。第六讲专门介绍了策略梯度方法、策略梯度定理和REINFORCE算法,常从与深度强化学习相关的现代视角进行阐述。
Asynchronous Methods for Deep Reinforcement Learning, Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu, 2016Proceedings of The 33rd International Conference on Machine Learning, Vol. 48 (PMLR)DOI: 10.48550/arXiv.1602.01783 - 这篇论文虽然介绍了A3C算法,但其关键部分在于讨论了用于策略梯度方差缩减的优势函数,这与REINFORCE算法提及的缺点直接相关。