Data Representation for Models

Machine learning is about finding patterns in data. But for a computer to process this data, it must be converted into a structured, numerical format. This is where linear algebra provides the essential framework. It gives us the tools to organize information into clean, rectangular arrays of numbers that algorithms can work with.

From Observations to Feature Vectors

Think about a simple dataset used for predicting house prices. For each house, we might record several attributes: its size in square feet, the number of bedrooms, and the number of bathrooms. In machine learning, these attributes are called features. A single house, with all its features, represents one observation or sample.

To a machine learning model, one specific house, say with 1,500 square feet, 3 bedrooms, and 2 bathrooms, is not a description. It's a point in space. We can represent this house as a feature vector, which is simply an ordered list of its numerical features.

\text{house\_1} = \begin{bmatrix} 1500 \\ 3 \\ 2 \end{bmatrix}

The order in the vector is important. We must decide on a consistent sequence, for example, [square footage, bedrooms, bathrooms]. Every single house in our dataset will be converted into a vector following this same structure. This allows us to compare different houses and perform mathematical operations on them consistently.

From a Collection of Vectors to a Data Matrix

A dataset rarely contains just one observation. Typically, we have hundreds, thousands, or even millions of them. Stacking all of our individual feature vectors together gives us a data matrix, which is almost always denoted by a capital letter, such as $X$ .

By convention in machine learning, each row in the data matrix corresponds to a single observation (a house), and each column corresponds to a single feature (e.g., square footage).

If we have data for three houses:

1,500 sq ft, 3 bedrooms, 2 bathrooms
2,100 sq ft, 4 bedrooms, 3 bathrooms
1,200 sq ft, 2 bedrooms, 2 bathrooms

We can stack their feature vectors to form the data matrix $X$ :

X = \begin{bmatrix} 1500 & 3 & 2 \\ 2100 & 4 & 3 \\ 1200 & 2 & 2 \end{bmatrix}

This matrix now contains our entire dataset of features. The dimensions of this matrix tell us about our dataset. If a matrix $X$ has a shape of $m \times n$ , it means we have $m$ observations and $n$ features. In our example, $X$ is a $3 \times 3$ matrix.

Representing Labels with a Target Vector

In many machine learning tasks, known as supervised learning, our goal is to predict a specific outcome. For our housing dataset, this would be the price of each house. This outcome is called the label or target.

Just as we did with the features, we collect the labels for each observation into a vector, commonly named $y$ . If the prices for our three houses were $300,000,$ 450,000, and $250,000, our target vector$ y$ would be:

y = \begin{bmatrix} 300000 \\ 450000 \\ 250000 \end{bmatrix}

The first entry in $y$ corresponds to the first row in $X$ , the second to the second, and so on. This alignment is critical for the model to learn the relationship between the features in $X$ and the outcomes in $y$ .

A diagram showing how raw data is split and organized into a feature matrix $X$ and a target vector $y$ , which then serve as inputs to a machine learning model.

The Power of This Representation

Structuring data this way might seem like a simple organizational step, but it's the foundation upon which all modern machine learning is built. Here’s why this is so effective:

Computational Efficiency: Modern computing libraries like NumPy and TensorFlow are highly optimized to perform calculations on matrices. Performing one operation on an entire matrix is faster than looping through each data point individually. This practice, called vectorization, is fundamental to making machine learning practical on large datasets.
Mathematical Simplicity: This representation allows us to express complex learning algorithms with compact and elegant mathematical notation. As you'll see in the next section, the entire problem of linear regression can be summarized in the simple equation $Ax = b$ . This abstraction makes it easier to reason about and implement algorithms.
Universality: This structure is universal. Whether your raw data consists of images, text, user clicks, or sensor readings, the first step is almost always to convert it into a numerical matrix. An image becomes a matrix of pixel values, and text can be converted into a matrix representing word counts. Once in this form, the same linear algebra toolkit can be applied to different problems.

In NumPy, creating these structures is straightforward. Our house data matrix and target vector would look like this:

import numpy as np

# Data matrix X (3 observations, 3 features)
X = np.array([
    [1500, 3, 2],
    [2100, 4, 3],
    [1200, 2, 2]
])

# Target vector y (3 labels)
y = np.array([300000, 450000, 250000])

print("Data Matrix X:\n", X)
print("\nTarget Vector y:\n", y)

By organizing data into vectors and matrices, we translate a practical problem into a mathematical one. This allows us to use the full power of linear algebra to find patterns, make predictions, and build the intelligent systems that define machine learning.

Was this section helpful?

References

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Aurélien Géron, 2022 (O'Reilly Media) - This book offers practical guidance on preparing and representing data, including features, observations, and target variables, using vectors and matrices with Python libraries for machine learning tasks.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Chapter 2 of this foundational textbook provides a detailed mathematical explanation of vectors, matrices, and linear algebra operations crucial for understanding data representation in machine learning and deep learning models.
Introduction to Linear Algebra, Gilbert Strang, 2016 (Wellesley-Cambridge Press) - A classic textbook that systematically introduces the core concepts of vectors, matrices, and vector spaces, forming the fundamental mathematical tools for structuring and manipulating data in machine learning.
NumPy ndarray object, NumPy Developers, 2024 (NumPy Project) - The official documentation for NumPy's fundamental N-dimensional array object, which is essential for efficient numerical data representation and manipulation in machine learning applications using Python.