Supervised Fine-Tuning (SFT) provides a baseline, but achieving alignment requires a more direct way to incorporate human judgment about response quality. This chapter shifts focus to building a Reward Model ( $RM$ ), a critical component that learns to predict which AI-generated responses humans prefer.

You will learn the process of creating this $RM$ , starting with the concept of learning directly from pairwise comparisons. We will cover methods for collecting human preference data, structuring these datasets, and selecting appropriate model architectures. The chapter details common training objectives, such as those based on the Bradley-Terry model, formulated as:

$P(\text{response}_1 \succ \text{response}_2 | \text{prompt}) = \sigma(RM(\text{prompt}, \text{response}_1) - RM(\text{prompt}, \text{response}_2))$

where $\sigma$ is the sigmoid function and $\succ$ denotes preference.

Furthermore, we'll examine techniques for calibrating reward model scores to better reflect preference strength and discuss potential challenges encountered during reward modeling, including data quality issues and the risk of the model finding unintended shortcuts (reward hacking). By the end of this chapter, you will understand how to train a model that quantifies human preferences, setting the stage for reinforcement learning optimization.

Chapter 3: Reward Modeling from Human Preferences

Sections