While the terms "interpretability" and "explainability" are often used interchangeably in discussions about understanding machine learning models, they carry distinct meanings. Recognizing this difference is helpful as we explore methods for peering inside the "black box."
Interpretability: Understanding the Mechanics
Interpretability refers to the extent to which a human can consistently predict the model's output or understand its internal mechanisms. Think of it as the inherent transparency of the model itself. How easily can you grasp the process by which the model transforms input features (X) into a prediction (y)?
Models vary significantly in their inherent interpretability:
- High Interpretability: Simple models like linear regression are highly interpretable. The relationship between each feature and the output is explicitly defined by a coefficient. You can directly say, "Increasing feature A by one unit changes the prediction by B units, holding others constant." Similarly, small decision trees are interpretable; you can follow the specific path of splits that lead to a prediction.
- Low Interpretability: Complex models like deep neural networks, gradient boosting machines (GBMs), or large random forests often have low interpretability. They involve intricate interactions between potentially thousands or millions of parameters. Understanding precisely how a specific input traverses through the network layers or the ensemble of trees to arrive at a final prediction is incredibly difficult for a human observer.
So, interpretability is fundamentally about whether the model's structure and parameters themselves are understandable.
Explainability: Justifying the Decisions
Explainability focuses on providing a human-understandable reason for a model's output, regardless of the model's internal complexity. It's about extracting insights or justifications after the model has made a prediction or learned a general behavior. Explainability often relies on supplementary techniques, especially for models that lack inherent interpretability.
The goal of explainability is to answer questions like:
- "Why did the model predict this specific outcome for this particular data point?" (Local Explanation)
- "Which features generally have the most impact on the model's predictions overall?" (Global Explanation)
- "How does the model's prediction change as I vary a specific feature's value?"
Techniques like LIME and SHAP, which we will cover in detail later, are primarily tools for achieving explainability. They aim to provide insights into the behavior of potentially complex, non-interpretable models.
The Key Distinction
Here's a way to frame the difference:
- Interpretability: Can you understand how the model works inherently? (Focus on the model's mechanism)
- Explainability: Can you provide an understandable reason for why the model made a specific decision or behaves a certain way, even if its internal mechanics are complex? (Focus on the model's output and justification)
Imagine a highly skilled chef (the model).
- Interpretability is like understanding the chef's detailed recipe and cooking techniques for a specific dish. For a simple dish (simple model), this might be easy. For a complex one with secret ingredients and techniques (complex model), it might be nearly impossible.
- Explainability is like the chef telling you why certain ingredients (features) were essential for the final taste (prediction) of the dish you just ate, even if they don't reveal the entire complex recipe.
While an inherently interpretable model is typically also explainable, a model doesn't need to be fully interpretable to be explainable. We often apply explainability techniques precisely because the model lacks inherent interpretability. Our focus in this course is largely on these techniques that bring explainability to complex, powerful machine learning models.