Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.2203.02155 - Introduces the InstructGPT model and the Reinforcement Learning from Human Feedback (RLHF) paradigm, where a reward model is trained to approximate human preferences, acknowledging its role as an imperfect proxy.
Concrete Problems in AI Safety, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, 2016arXiv preprint arXiv:1606.06565 (arXiv)DOI: 10.48550/arXiv.1606.06565 - A foundational paper that identifies and defines several core challenges in AI safety, including 'specification gaming' (reward hacking), which occurs when an AI optimizes a flawed objective function.