As discussed earlier in this chapter, Direct Preference Optimization (DPO) offers a compelling alternative to the multi-stage RLHF pipeline. Instead of training a separate reward model, DPO directly optimizes the language model policy using preference data. This section provides a hands-on guide to understanding and implementing the DPO loss function, a core component of this technique.The DPO Objective RevisitedRecall that DPO aims to increase the relative log probability of preferred responses ($y_w$) compared to rejected responses ($y_l$) for a given prompt ($x$). It achieves this by implicitly defining a reward function based on the ratio of the policy model's ($\pi_\theta$) probability and a reference model's ($\pi_{ref}$) probability for a given completion. The objective function derived from this formulation is:$$ L_{DPO}(\pi_\theta; \pi_{ref}) = - \mathbb{E}{(x, y_w, y_l) \sim D} \left[ \log \sigma \left( \beta \log \frac{\pi\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right] $$Let's break down the terms:$D$: The dataset of preference triplets $(x, y_w, y_l)$.$\pi_\theta$: The language model we are fine-tuning (the policy).$\pi_{ref}$: A fixed, reference language model (often the initial SFT model before DPO).$\beta$: A hyperparameter that controls how much weight is given to the preference data versus staying close to the reference model. Higher $\beta$ means stronger adherence to the reference.$\log \pi(y|x)$: The sum of the log probabilities of the tokens in sequence $y$ given prompt $x$.$\sigma$: The logistic sigmoid function, $\sigma(z) = 1 / (1 + e^{-z})$.$\log \sigma(\cdot)$: The log-sigmoid function. The loss aims to maximize the argument inside the $\log \sigma$, which represents the scaled difference in implicit rewards between the chosen and rejected responses. Maximizing $\log \sigma(z)$ is equivalent to minimizing $-\log \sigma(z)$.The term inside the $\log \sigma$ function can be rewritten as:$$ \beta \left( (\log \pi_\theta(y_w|x) - \log \pi_{ref}(y_w|x)) - (\log \pi_\theta(y_l|x) - \log \pi_{ref}(y_l|x)) \right) $$This highlights that we are maximizing the difference between the log-probability ratios of the chosen response and the rejected response, scaled by $\beta$.Implementing the DPO Loss CalculationTo implement this loss in a typical deep learning framework like PyTorch or TensorFlow, you need the log probabilities of the chosen ($y_w$) and rejected ($y_l$) sequences under both the policy model ($\pi_\theta$) being trained and the frozen reference model ($\pi_{ref}$).Here are the steps to compute the loss for a batch of preference data:Obtain Log Probabilities: Perform forward passes for the prompts ($x$) and both completions ($y_w$, $y_l$) through both the policy model and the reference model. This yields four sets of log probabilities for each sample in the batch:policy_chosen_logps: $\log \pi_\theta(y_w|x)$policy_rejected_logps: $\log \pi_\theta(y_l|x)$ref_chosen_logps: $\log \pi_{ref}(y_w|x)$ref_rejected_logps: $\log \pi_{ref}(y_l|x)$ Remember that the reference model $\pi_{ref}$ is not updated during training; its parameters remain fixed, and gradients are not computed for it. The policy model $\pi_\theta$ is the one whose parameters are being optimized.Calculate Log Ratios: Compute the log probability ratios relative to the reference model for both chosen and rejected responses:log_ratio_w = policy_chosen_logps - ref_chosen_logpslog_ratio_l = policy_rejected_logps - ref_rejected_logpsCalculate the Difference: Find the difference between these log ratios and scale by $\beta$:diff = beta * (log_ratio_w - log_ratio_l)Apply Logistic Loss: Compute the negative log-sigmoid of the difference. This is the core DPO loss for each sample. Using standard library functions like logsigmoid helps maintain numerical stability.loss_per_sample = -torch.nn.functional.logsigmoid(diff) (in PyTorch) or equivalent. Note that $-\log \sigma(z)$ is equivalent to $\text{softplus}(-z)$.Average the Loss: Compute the mean of loss_per_sample across the batch to get the final loss value for the training step.Code Example (PyTorch)Below is a Python function using PyTorch that demonstrates the DPO loss calculation, assuming you have already computed the necessary log probabilities.import torch import torch.nn.functional as F def compute_dpo_loss(policy_chosen_logps: torch.Tensor, policy_rejected_logps: torch.Tensor, ref_chosen_logps: torch.Tensor, ref_rejected_logps: torch.Tensor, beta: float) -> torch.Tensor: """ Computes the Direct Preference Optimization (DPO) loss. Args: policy_chosen_logps: Log probabilities of the chosen responses under the policy model. Shape: (batch_size,) policy_rejected_logps: Log probabilities of the rejected responses under the policy model. Shape: (batch_size,) ref_chosen_logps: Log probabilities of the chosen responses under the reference model. Shape: (batch_size,) ref_rejected_logps: Log probabilities of the rejected responses under the reference model. Shape: (batch_size,) beta: Temperature parameter controlling the deviation from the reference model. Returns: The average DPO loss over the batch. """ # Calculate the log probability ratios for chosen and rejected responses # pi_logratios = policy_chosen_logps - policy_rejected_logps # Not directly used in formula like this # ref_logratios = ref_chosen_logps - ref_rejected_logps # Not directly used in formula like this # Calculate log ratios referenced to the base model (pi_ref) log_ratio_chosen = policy_chosen_logps - ref_chosen_logps log_ratio_rejected = policy_rejected_logps - ref_rejected_logps # Calculate the difference, scaled by beta # This term represents beta * (reward_chosen - reward_rejected) # where reward is implicitly defined as log(pi_policy / pi_ref) diff = beta * (log_ratio_chosen - log_ratio_rejected) # Calculate the loss using the negative log-sigmoid function # loss = -log(sigmoid(diff)) = softplus(-diff) # Using log_sigmoid for numerical stability: log_sigmoid(x) = -softplus(-x) # So, loss = -log_sigmoid(diff) loss = -F.logsigmoid(diff) # Average the loss over the batch average_loss = loss.mean() return average_loss # --- Usage --- # Assume these tensors come from forward passes of your models # (e.g., using model.forward(input_ids, labels=labels).logits) # Typically, log probabilities are summed over the sequence length for each response. batch_size = 8 # Example log probabilities (ensure they are properly calculated in practice) policy_chosen_logps = torch.tensor([-10.5, -12.1, -9.8, -11.0, -13.5, -10.1, -11.8, -12.5], requires_grad=True) policy_rejected_logps = torch.tensor([-11.2, -11.9, -10.5, -11.5, -12.8, -10.9, -12.3, -13.0], requires_grad=True) ref_chosen_logps = torch.tensor([-10.2, -11.8, -9.5, -10.7, -13.0, -9.8, -11.5, -12.1]) # No gradients needed ref_rejected_logps = torch.tensor([-11.0, -11.5, -10.1, -11.1, -12.2, -10.5, -11.9, -12.5]) # No gradients needed beta_value = 0.1 # Compute the DPO loss dpo_loss_value = compute_dpo_loss(policy_chosen_logps, policy_rejected_logps, ref_chosen_logps, ref_rejected_logps, beta_value) print(f"Computed DPO Loss: {dpo_loss_value.item():.4f}") # --- In a training loop --- # optimizer.zero_grad() # dpo_loss_value.backward() # Compute gradients for policy model parameters # optimizer.step()Practical NotesReference Model Management: Ensure the reference model's weights are frozen and it's set to evaluation mode (e.g., ref_model.eval() in PyTorch) to disable dropout or other training-specific behaviors during the forward pass for log probability calculation.Log Probability Calculation: Correctly calculating $\log \pi(y|x)$ involves running the model autoregressively or using teacher-forcing with the target sequence $y$ as labels. You need the sum of the log probabilities of the actual tokens in the sequence $y$. Frameworks like Hugging Face's transformers often provide ways to get sequence likelihoods.The $\beta$ Hyperparameter: Tuning $\beta$ is important. A common starting point is around 0.1. Experimentation is needed to find a value that effectively incorporates preferences without causing the policy model to deviate too drastically from the reference model's fluency and knowledge, which could lead to degraded performance on other tasks or generation quality issues.Numerical Stability: Working in log-space, as shown, is essential. Using library functions like F.logsigmoid prevents potential underflow/overflow issues that might arise from calculating log(1 / (1 + exp(-diff))) directly.Integration into Training: This loss function replaces the standard cross-entropy loss used during supervised fine-tuning. The training loop involves sampling batches of $(x, y_w, y_l)$, performing the four forward passes (two with $\pi_\theta$, two with $\pi_{ref}$), calculating the DPO loss, and then backpropagating the gradients only through the policy model $\pi_\theta$.ExerciseConsider adapting a standard language model fine-tuning script (perhaps one using Hugging Face transformers). Modify the training loop to incorporate the DPO loss calculation. You will need:A dataset formatted as preference triplets.Two instances of your base language model: one for the policy (policy_model) and one for the reference (ref_model). Ensure ref_model is frozen.A function to compute the sequence log probabilities for chosen and rejected responses under both models.The compute_dpo_loss function (or similar) integrated into your training step.Libraries like Hugging Face's trl offer pre-built DPOTrainer classes that abstract away much of this complexity, but implementing the core loss yourself provides a deeper understanding of the underlying mechanics.