KL散度惩罚的作用

这部分内容有帮助吗？

参考文献

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Ouyang, Long, Wu, Jeff, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll L., Mishkin, Pamela, Zhang, Chong, Agarwal, Sandhini, Slama, Katarina, Ray, Alex, Schulman, John, Hilton, Jacob, Kelton, Fraser, Miller, Luke, Simens, Maddie, Askell, Amanda, Welinder, Peter, Christiano, Paul, Leike, Jan, Lowe, Ryan, 2022 arXiv preprint arXiv:2203.02155 DOI: 10.48550/arXiv.2203.02155 - 介绍了InstructGPT模型，详细阐述了带有KL散度惩罚的RLHF流程，以使语言模型与人类偏好对齐。
Proximal Policy Optimization Algorithms, Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, and Klimov, Oleg, 2017 arXiv preprint arXiv:1707.06347 DOI: 10.48550/arXiv.1707.06347 - 提出了基础的近端策略优化（PPO）算法，它是RLHF流程中策略更新的核心组成部分。
Learning to Summarize with Reinforcement Learning from Human Feedback, Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, 2020 NeurIPS 2020 DOI: 10.48550/arXiv.2009.01325 - 一篇早期且有影响力的论文，展示了RLHF在文本摘要中的应用，明确使用KL散度作为惩罚项以保持文本质量。