This chapter establishes the fundamental concepts necessary for understanding Large Language Model (LLM) alignment. Before tackling advanced techniques, we need a solid grasp of the basics.
We start by providing a working definition of alignment specifically for LLMs and outlining the core objectives and difficulties associated with the alignment problem. You will review how instruction following and standard fine-tuning serve as foundational steps towards aligned behavior. We will then introduce initial approaches to measuring alignment and discuss why these methods often fall short.
The chapter also differentiates between the concepts of inner and outer alignment. Understanding this distinction is helpful for diagnosing failure modes. Finally, we examine common problems like specification gaming, where a model optimizes a proxy objective Rproxy instead of the intended objective Rintended, and reward hacking, illustrating how models can exploit poorly defined goals. These foundational ideas prepare you for the more complex methods discussed later in the course.
1.1 Defining Alignment in Large Language Models
1.2 The Alignment Problem: Objectives and Challenges
1.3 Instruction Following and Fine-tuning Review
1.4 Measuring Alignment: Initial Metrics and Limitations
1.5 The Concept of Inner and Outer Alignment
1.6 Specification Gaming and Reward Hacking
© 2025 ApX Machine Learning