Learning to summarize with human feedback, Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2009.01325 - 这是OpenAI早期的一项工作,展示了如何利用人类反馈训练奖励模型,特别针对摘要任务,并采用了类似的成对偏好学习目标。