While accuracy gives us a quick summary of overall correctness (the proportion of total predictions that were right), it can sometimes paint an overly optimistic or even misleading picture of a model's performance. This often happens when dealing with imbalanced datasets.
An imbalanced dataset is one where the number of observations belonging to one class is significantly lower than those belonging to the other classes. Think about tasks like:
In these situations, one class (the "majority" class, like healthy patients or legitimate transactions) vastly outnumbers the other (the "minority" class, like sick patients or fraudulent transactions).
Let's consider our fraud detection example again. Imagine we have data on 1,000 transactions.
The vast majority of transactions are legitimate, while fraudulent ones are rare.
Now, suppose we build a very simple (perhaps naive) classification model. This model isn't very sophisticated; it simply predicts that every transaction is legitimate. It never flags anything as fraudulent.
How accurate is this model? Let's calculate:
The total number of correct predictions is 990. The total number of predictions is 1000.
So, the accuracy is:
Accuracy=Total Number of PredictionsNumber of Correct Predictions=1000990=0.99An accuracy of 99%! That sounds fantastic, right?
Here's the catch: while the model achieves 99% accuracy, it completely fails at the task it was intended for, which is detecting fraud. It correctly identifies all the legitimate transactions but misses every single fraudulent one. A model that catches zero fraud is practically useless, even if its overall accuracy score is high.
This happens because the sheer number of majority class examples dominates the accuracy calculation. The model gets high marks just by correctly identifying the most common outcome. The errors made on the small number of minority class examples barely affect the overall percentage.
In many real-world applications, especially those involving imbalanced data, correctly identifying the minority class is often the most important goal. Missing a fraudulent transaction can be much more costly than misclassifying a legitimate one. Similarly, failing to detect a rare disease can have severe consequences.
Therefore, relying solely on accuracy in such scenarios can lead to deploying models that perform poorly on the tasks we care about most. This highlights the need for other evaluation metrics that give us more insight into how the model performs on each class, particularly the minority class. Metrics derived from the confusion matrix, like precision and recall, which we'll discuss next, help provide this deeper understanding.
© 2025 ApX Machine Learning