Machine learning algorithms learn from experience, much like humans do. But what constitutes "experience" for an algorithm? The answer is data. If you think of a machine learning model as an engine, then data is its fuel. Without data, the engine cannot run; it has nothing to learn from.
In the context of machine learning, data typically refers to a collection of observations or examples. For many introductory problems, this data is often organized in a structured format, like a table or spreadsheet, where each row represents a single observation (also called a sample, instance, or data point), and each column represents a specific characteristic or feature of that observation.
Imagine you want to build a system to predict whether an email is spam or not. Your data might look something like this:
A simple representation of email data. Each row is an email instance. The blue columns are features used for prediction, and the red column is the label we want the model to learn to predict.
Let's break down the terms using this example:
The quality and quantity of data are extremely significant for the success of a machine learning project.
Data can come in various types, such as numerical (e.g., temperature, height), categorical (e.g., color names, email sender domain), or text. Preparing this raw data into a suitable format for algorithms is a large part of the machine learning process, which we will cover in Chapter 6, "Preparing Your Data".
For now, the essential takeaway is that data forms the foundation upon which machine learning models are built. Understanding how data is structured, including the distinction between features and labels, is the first step towards understanding how these models learn.
© 2025 ApX Machine Learning