Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2203.02155 - 介绍InstructGPT模型和基于人类反馈的强化学习(RLHF)范式,其中奖励模型被训练以近似人类偏好,并承认其作为不完善代理的作用。
Concrete Problems in AI Safety, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, 2016arXiv preprint arXiv:1606.06565 (arXiv)DOI: 10.48550/arXiv.1606.06565 - 一篇基础论文,识别并定义了AI安全中的几个核心挑战,包括“规范博弈”(奖励欺骗),即AI优化了一个有缺陷的目标函数时发生的情况。