Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - 这篇论文详细介绍了使用人类反馈进行监督微调和强化学习的过程,为校准语言模型中的常见挑战提供了背景。