In machine learning, one common type of problem involves teaching a computer to sort things into distinct groups or categories. This task is called classification. Think of it like a digital sorting hat: you provide some information (an input), and the model tells you which predefined category that input belongs to.
The goal of a classification model is to learn a mapping from input features (characteristics of the data) to specific output labels, often called classes. These labels represent discrete, distinct categories.
You encounter classification problems frequently, perhaps without realizing it:
spam
or not spam
(ham). These are the two possible categories or classes.cat
, a dog
, or a bird
.disease
or no disease
.positive
, negative
, or neutral
.In each case, the model's output is a specific category label chosen from a finite set of possibilities.
At a high level, a classification model learns patterns from data where the correct categories are already known (this is called labeled training data). For instance, to build a spam detector, we'd show the model many examples of emails, each already marked as spam
or not spam
. The model studies the features of these emails (like specific words, sender reputation, etc.) and learns rules or patterns that distinguish spam from legitimate messages.
Once trained, the model can take a new, unseen email, examine its features, and predict which category it belongs to.
A conceptual flow showing how input features are processed by a classification model to produce a predicted class label.
It's useful to contrast classification with regression (which we'll discuss next). While classification assigns data points to discrete categories (like spam
/not spam
, cat
/dog
), regression models predict continuous numerical values (like the price of a house, the temperature tomorrow, or a student's test score). The type of output (category vs. number) is the fundamental difference.
Understanding classification is essential because evaluating these models requires specific metrics. We need to know more than just whether a prediction was right or wrong; we often need to understand the types of errors the model makes. For example, in spam detection, incorrectly classifying a legitimate email as spam (a false positive) might be more problematic than letting a spam email through (a false negative). Metrics designed for classification help us measure this performance accurately, which we will explore in detail in the next chapter.
© 2025 ApX Machine Learning