In the previous sections, we discussed model generalization and the common problems of underfitting and overfitting. Underfitting occurs when a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and unseen data. Overfitting happens when a model learns the training data too well, including its noise and idiosyncrasies, resulting in excellent performance on the training set but poor performance on new data. The bias-variance tradeoff provides a valuable framework for understanding the relationship between model complexity and these generalization errors.
Imagine you are trying to hit a target with darts.
- Bias is like a systematic error in your aim. If you consistently throw darts to the upper left of the bullseye, you have high bias. Your throws are systematically off target, regardless of how tightly grouped they are. In machine learning, high bias means the model makes strong, potentially incorrect, assumptions about the data. It fails to capture the true relationship between inputs and outputs, leading to underfitting. Simple models, like linear regression applied to highly non-linear data, often exhibit high bias.
- Variance is like the scatter of your throws. If your darts land all over the board, even if centered around the bullseye on average, you have high variance. Your throws are very sensitive to small changes in your stance or release. In machine learning, high variance means the model is overly sensitive to the specific training data it was exposed to. It learns fluctuations, including noise, which don't generalize. Complex models, like deep neural networks with many parameters, are prone to high variance, leading to overfitting.
The Tradeoff Explained
There's an inherent tension between bias and variance.
- Increasing model complexity generally decreases bias. A more flexible model can better fit intricate patterns in the training data, reducing systematic errors.
- However, increasing model complexity generally increases variance. A more flexible model can also more easily fit the noise in the specific training set, making it less stable and perform poorly on different data.
Conversely, simplifying a model tends to increase bias but decrease variance. This relationship is the "tradeoff": it's challenging to minimize both sources of error simultaneously using only model complexity as the lever.
Decomposing Prediction Error
We can think of the expected prediction error of a model for a given input point x as decomposing into three parts:
ExpectedError(x)=(Bias[f^(x)])2+Variance[f^(x)]+Irreducible Error
Let's break this down:
- Bias Squared ((Bias[f^(x)])2): Measures how far, on average, the model's predictions are from the true underlying function f(x). It reflects the error introduced by the model's simplifying assumptions.
- Variance (Variance[f^(x)]): Measures the variability of the model's predictions for a given point x if we were to retrain the model multiple times on different training sets drawn from the same distribution. It reflects the model's sensitivity to the specific training data.
- Irreducible Error (σ2): Represents the noise inherent in the data itself. This error cannot be reduced by any model, no matter how good. It's the lower bound on the expected error.
Our goal in training is to find a model complexity that balances bias and variance to minimize the total expected error (primarily the sum of bias squared and variance, as we can't control the irreducible error).
Visualizing the Tradeoff
The relationship between model complexity, bias, variance, and overall error is often visualized as follows:
The chart illustrates how increasing model complexity typically reduces bias but increases variance. The total expected error (often approximated by validation error) initially decreases as bias drops, but then increases as variance starts to dominate. The optimal complexity balances these two components.
Bias and Variance in Deep Learning
Deep neural networks are typically highly flexible models with millions, sometimes billions, of parameters. This means they generally have the capacity to achieve very low bias. They can approximate extremely complex functions. Consequently, when working with deep learning models, the primary challenge often shifts towards controlling variance and preventing overfitting.
While the classical view suggests a clear U-shaped curve for test error as complexity increases, the behavior of deep learning models can sometimes be more intricate. However, the fundamental principles remain informative:
- Underfitting (High Bias): If your deep network performs poorly even on the training data, it might lack the capacity (e.g., too few layers/neurons) or hasn't been trained long enough. This suggests a bias problem.
- Overfitting (High Variance): If your network achieves very low training error but high validation/test error, it's likely overfitting. The model is too sensitive to the training data specifics. This indicates a variance problem.
The techniques discussed throughout this course, such as regularization (L1/L2, Dropout, Batch Normalization) and optimization strategies, are largely designed to help manage this tradeoff, primarily by controlling the model's effective complexity and reducing variance without significantly increasing bias. Understanding the bias-variance tradeoff helps us diagnose model performance issues and select appropriate methods to improve generalization.