In the previous chapter, we explored regression, where the objective was to predict a continuous numerical value, like the price of a house or the temperature tomorrow. Now, we turn our attention to a different kind of supervised learning task: classification.
Imagine you're not trying to predict a specific number, but rather trying to place something into a group or category. That's the essence of classification. The goal is to learn a mapping from input variables (features) to predefined, discrete categories or classes
.
Think about these common scenarios:
spam
or not spam
?cat
, a dog
, or a car
?disease
or no disease
?churn
(stop using a service) or not churn
?In each case, the prediction isn't a number on a sliding scale; it's a distinct label chosen from a finite set of possibilities.
The defining characteristic of a classification problem is that the target variable, the thing we want to predict, is categorical. This means it takes on values that represent distinct groups or classes.
features
to make predictions. For spam detection, features might include the frequency of certain words, the sender's address, or whether the email contains attachments. For medical diagnosis, features could be patient age, blood pressure, or results from specific lab tests.class label
(or simply class
or category
). These labels are predefined. Examples include {'spam', 'not spam'}
, {'cat', 'dog', 'car'}
, {'disease A', 'disease B', 'healthy'}
.Classification problems can be broadly categorized into two main types:
Binary Classification: This is the simplest form, where there are only two possible outcome classes. Many real-world questions fall into this category, often framed as yes/no decisions.
spam
/not spam
), medical testing (positive
/negative
), churn prediction (churn
/no churn
).Multiclass Classification: Here, there are three or more possible outcome classes. The task is to assign an instance to one of these multiple categories.
0
, 1
, 2
, ..., 9
), object recognition in images (person
, car
, tree
, building
), document categorization (sports
, politics
, technology
, business
).Let's visualize a simple classification scenario. Imagine we have data points with two features (Feature 1 and Feature 2), and each point belongs to one of two classes (Class A or Class B).
Data points belonging to two different classes plotted based on two features. The goal of a classification algorithm is to learn how to distinguish between these classes based on their features.
A classification algorithm's job is to learn a "rule" or "boundary" (often called a decision boundary, which we'll discuss later) that separates the different classes based on the patterns in the features. When a new, unseen data point arrives, the algorithm uses this learned rule to assign it to the most likely class.
In this chapter, we will look at specific algorithms designed for these tasks, starting with Logistic Regression, which is commonly used for binary classification, and then K-Nearest Neighbors (KNN), a versatile algorithm applicable to both binary and multiclass problems. We'll also cover how to measure how well our classification models are performing.
© 2025 ApX Machine Learning