Once you have successfully trained an AI preference model, pθ(yw≻yl∣x), capable of discerning preferred responses based on AI-generated labels, the next significant step is translating these pairwise preferences into a usable scalar reward signal, r(x,y). This reward function is the foundation for the subsequent reinforcement learning phase, guiding the LLM policy towards generating outputs that align with the learned preferences.
The fundamental assumption, inherited from RLHF, is that the preference model pθ approximates an underlying latent reward function rϕ(x,y). A common modeling choice is the Bradley-Terry model, which posits that the probability of preferring response yw over yl given prompt x can be expressed as:
p(yw≻yl∣x)=exp(rϕ(x,yw))+exp(rϕ(x,yl))exp(rϕ(x,yw))=σ(rϕ(x,yw)−rϕ(x,yl))Here, σ(⋅) is the sigmoid function. During preference model training, we fit pθ to the AI-generated preference data (x,yw,yl), effectively learning an approximation of the difference in rewards between pairs of responses.
Our goal now is to extract a scalar reward function r(x,y) that reflects the quality of a single response y given prompt x, based on the trained preference model pθ. Since pθ models the difference in rewards, the absolute scale of the extracted reward function r(x,y) is somewhat arbitrary, but its relative values should reflect the preferences learned by pθ.
A prevalent method to define the reward signal r(x,y) involves leveraging the internal computations of the trained preference model pθ. Many preference model architectures compute an internal scalar score, let's call it sθ(x,y), for each input response y. The preference probability is then calculated based on the difference between the scores of the winning and losing responses:
pθ(yw≻yl∣x)=σ(sθ(x,yw)−sθ(x,yl))Given this structure, a natural choice for the reward function is to directly use this internal score:
r(x,y)=sθ(x,y)This score sθ(x,y) represents the learned quality of response y according to the AI preference labeler. It directly captures the information optimized during the preference model training.
Alternative Formulation:
In some implementations, the reward function might be defined using the logarithm of the sigmoid applied to the score:
r(x,y)=βlog(σ(sθ(x,y)))The scaling factor β is often introduced as a hyperparameter to control the magnitude of the rewards, which can significantly impact the RL training dynamics. This formulation emphasizes deviations from a neutral score (often zero).
Simply extracting a score is often insufficient for stable and effective RL training. Several practical considerations arise:
Reward Scaling and Normalization: The raw scores sθ(x,y) might have an arbitrary scale or offset. Large reward values can lead to excessively large policy updates and instability in algorithms like PPO. Conversely, very small rewards might result in slow learning. Common techniques include:
KL Divergence Penalty: To prevent the RL policy πRL from deviating too drastically from the original supervised fine-tuned (SFT) policy πSFT (or the policy trained after the CAI supervised phase), PPO typically incorporates a KL divergence penalty. The final reward signal used in the PPO update often takes the form:
rfinal(x,y)=r(x,y)−βKLKL(πRL(⋅∣x)∣∣πSFT(⋅∣x))Here, r(x,y) is the reward derived from the AI preference model, and βKL controls the strength of the penalty for deviating from the reference policy. Designing the reward function r(x,y) must consider its interaction with this KL term. The scale of r(x,y) relative to the KL penalty is a sensitive hyperparameter.
Reward Model Drift: If the AI preference model itself is updated or changed during the RL training process (less common, but possible in advanced setups), the reward function becomes non-stationary. This can complicate RL convergence and requires careful handling.
Bias Amplification: The AI preference model might inherit or even amplify biases present in the model used to generate the labels (e.g., the constitution-based critiquer or another LLM). The resulting reward function will encode these biases. It's important to be aware that the reward function optimizes for what the AI perceives as good, which might not perfectly align with the intended constitutional principles or broader safety goals.
The designed reward function r(x,y) serves as the primary signal guiding the optimization of the LLM policy πRL using PPO (or another suitable RL algorithm). During the PPO rollout phase, for each prompt x sampled from a dataset, the current policy πRL generates a response y. This response y is then fed into the frozen, trained AI preference model to compute the reward r(x,y). This reward, potentially combined with the KL penalty, is used to calculate the advantages and update the policy parameters.
Flow diagram illustrating the generation and use of the reward signal within the RLAIF PPO loop. The frozen AI preference model computes a score for the generated response, which is then transformed into the reward signal used for the policy update.
Designing an effective reward function from AI preferences is a careful process. It requires not only understanding the theoretical link between preference probabilities and scalar rewards but also careful implementation regarding scaling, normalization, and integration with the chosen RL algorithm's objective function to ensure stable and efficient policy optimization. The quality of the reward function directly depends on the quality and consistency of the upstream AI preference labeler and the preference model trained upon its outputs.
Was this section helpful?
© 2025 ApX Machine Learning