As we move from building aligned models to assessing them, we need concrete criteria to judge their behavior. Simply stating a model is "aligned" or "safe" is insufficient without defining what those terms mean in practice. The goal is to operationalize these high-level concepts into measurable dimensions that reflect desired performance and undesirable failure modes.
A widely adopted framework for structuring this evaluation, particularly emphasized by Anthropic and influential across the field, revolves around three core principles: Helpfulness, Honesty, and Harmlessness (often referred to as HHH). While other framings exist, these three dimensions provide a robust starting point for dissecting and evaluating the complex behavior of LLMs. They serve as guideposts for both alignment techniques during training and evaluation protocols post-deployment.
Let's examine each dimension in detail:
Helpfulness relates directly to the model's ability to understand and successfully fulfill a user's request or implicit intent. An ideal helpful model should:
Achieving helpfulness isn't always straightforward. It requires the model to interpret ambiguity, make reasonable assumptions, and sometimes even clarify the user's intent. Overly prioritizing helpfulness without considering other dimensions can lead to issues; for instance, a model might try to be helpful by answering a question it doesn't actually know the answer to (violating honesty) or by fulfilling a harmful request (violating harmlessness).
Honesty in the context of LLMs primarily concerns factual accuracy and truthful representation. It's not about moral intent but about the reliability of the information provided. Key aspects include:
Evaluating honesty often involves comparing model outputs against ground-truth data sources or using benchmarks specifically designed to test factual recall and resistance to generating misinformation (like TruthfulQA, which we'll discuss later). A critical challenge is distinguishing between genuine knowledge gaps and subtle fabrications woven into otherwise correct statements.
Harmlessness focuses on preventing the model from generating outputs that could cause harm to individuals or groups. This is a broad category encompassing various types of undesirable content:
Defining harmlessness precisely can be complex and context-dependent. Societal norms differ, and the line between expressing an opinion and causing harm can be blurry. Models often employ safety filters and refusal mechanisms to block harmful requests, but attackers constantly seek ways to bypass these defenses (as covered in Chapter 5). Overly strict harmlessness filters can also impede helpfulness by refusing benign requests that are misinterpreted as harmful (the "scunthorpe problem" or related issues).
It's important to recognize that these three dimensions are not entirely independent. There are often tensions and trade-offs:
Therefore, evaluating LLM safety and alignment isn't about maximizing each dimension in isolation but about achieving an acceptable balance that aligns with the specific application's goals and ethical considerations. Subsequent sections in this chapter will explore automated benchmarks, human evaluation, and red teaming techniques used to measure performance across these dimensions and navigate their inherent trade-offs. Understanding these definitions provides the foundation for applying those evaluation methodologies effectively.
© 2025 ApX Machine Learning