Masterclass
Supervised Fine-Tuning (SFT) adapts models to follow instructions, but achieving alignment with nuanced human values like helpfulness, honesty, and harmlessness often requires additional steps. This chapter introduces Reinforcement Learning from Human Feedback (RLHF), a technique for fine-tuning language models using human judgments about output quality.
We will examine the standard RLHF process, beginning with the collection of pairwise preference data. You will learn to train a reward model (RM) that predicts which model generation humans are likely to prefer. Following this, we detail how to use this RM as a reward signal within a reinforcement learning framework, typically Proximal Policy Optimization (PPO), to adjust the SFT model's behavior. The role of the Kullback-Leibler (KL) divergence penalty in stabilizing the RL process will be explained, along with a brief overview of alternative methods such as Direct Preference Optimization (DPO). By the end of this chapter, you will understand the components and implementation details of the RLHF pipeline for aligning large language models.
26.1 The RLHF Pipeline Overview
26.2 Collecting Human Preference Data
26.3 Training the Reward Model (RM)
26.4 Introduction to Proximal Policy Optimization (PPO)
26.5 RL Fine-tuning with PPO
26.6 The Role of the KL Divergence Penalty
26.7 Challenges and Considerations in RLHF
26.8 Alternatives: Direct Preference Optimization (DPO)
© 2025 ApX Machine Learning