As we established in the previous section, data is the essential ingredient for machine learning. But what exactly constitutes this data? When we talk about a dataset in machine learning, we typically refer to a collection of examples, and each example is described by its characteristics or attributes. These characteristics are broken down into two fundamental components: features and labels.
Think of features as the individual, measurable properties or characteristics of the phenomenon you are observing. They are the inputs you feed into your machine learning model to make a prediction or find a pattern. Each feature represents a piece of information about a single data point or example.
Consider these scenarios:
Features are often represented as columns in a spreadsheet or database table, where each row corresponds to a single example (like a specific house, email, or image). They can be numerical (like square footage or pixel values) or categorical (like the sender's domain name or the presence/absence of a keyword).
Other terms you might hear used interchangeably with features include:
The label (or target) is the specific thing you are trying to predict with your machine learning model. It's the "answer" or the outcome you want the model to learn to associate with the input features.
Labels are primarily associated with Supervised Learning, a type of machine learning we'll discuss more later, where the dataset includes the correct output for each input example. The model learns by comparing its predictions to these known labels.
Let's revisit our examples:
In datasets used for supervised learning, the label is typically represented as a dedicated column, distinct from the feature columns.
Common synonyms for label include:
The fundamental goal in supervised machine learning is to use the features to predict the label. The model learns the underlying relationship or pattern connecting the input features to the output label based on the examples provided in the training data.
Here's a simplified view of how features and labels might look in a small dataset for predicting house prices:
Example ID | Size (sq ft) (Feature 1) | Bedrooms (Feature 2) | Age (years) (Feature 3) | Price ($) (Label) |
---|---|---|---|---|
1 | 1500 | 3 | 10 | 350000 |
2 | 2100 | 4 | 5 | 480000 |
3 | 1200 | 2 | 25 | 290000 |
4 | 1800 | 3 | 8 | 410000 |
A simple tabular representation showing examples (rows) described by features (input columns) and their corresponding label (output column).
In this table:
The machine learning model would be trained on this data (or much more data like it) to learn how the size, number of bedrooms, and age relate to the final price. Once trained, you could give the model the features for a new house (e.g., 1600 sq ft, 3 bedrooms, 12 years old), and it would predict the label (the likely price).
It's important to note that not all machine learning tasks involve labels. In Unsupervised Learning, the goal is often to find structure or patterns within the data based only on the features, without any predefined correct answers. We will cover different types of learning in more detail later.
Understanding the distinction between features and labels is fundamental. It helps you frame your problem correctly, prepare your data appropriately, and select suitable machine learning algorithms. When someone provides a dataset, one of the first steps is often to identify which columns represent the input features and which column (if any) represents the target label you want to predict.
© 2025 ApX Machine Learning