Deep Reinforcement Learning from Human Preferences, Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, Dario Amodei, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) - 一项基础性工作,介绍了直接通过人类偏好比较训练强化学习代理的方法,为奖励建模奠定了基础。