Practice: Data Manipulation with NumPy

Applying theoretical knowledge to real datasets is fundamental in machine learning. Preparing a dataset for an algorithm involves understanding how it is loaded and structured. This hands-on exercise covers a common initial step in a machine learning workflow: converting raw data into a NumPy matrix.

A Simple Dataset

Let's imagine we have a small dataset for predicting house prices. The data includes the square footage, the number of bedrooms, and the final sale price. In a typical Python application, this data might start as a list of lists.

import numpy as np

# Each inner list represents a house: [Square Footage, Bedrooms, Price]
raw_data = [
    [1500, 3, 320000],
    [2100, 4, 450000],
    [1200, 2, 250000],
    [1800, 3, 380000]
]

While this format is readable, it is not optimized for the mathematical operations required by machine learning models. For that, we need to convert it into a NumPy array, which is the standard format for numerical data in Python.

Converting to a NumPy Matrix

Creating a NumPy matrix, or more accurately, a 2D ndarray, from a list of lists is straightforward using the np.array() function.

# Convert the list of lists into a 2D NumPy array
house_data_matrix = np.array(raw_data)

print(house_data_matrix)

This will produce the following output:

[[  1500      3 320000]
 [  2100      4 450000]
 [  1200      2 250000]
 [  1800      3 380000]]

Now our data is in a structured grid. Each row is a single observation (a house), and each column represents a specific attribute. This is the data matrix format that nearly every machine learning algorithm expects.

Separating Features and Targets

In supervised learning, we distinguish between the features (the inputs used to make a prediction) and the target (the value we want to predict).

Feature Matrix ( $X$ ): A matrix where rows are observations and columns are features. In our case, the features are 'Square Footage' and 'Bedrooms'.
Target Vector ( $y$ ): A vector containing the value we want to predict for each observation. Here, it is the 'Price'.

We can use NumPy's powerful slicing capabilities to separate our house_data_matrix into X and y.

The original data matrix is split into a feature matrix X and a target vector y.

Here is how to perform this split in code:

# Select all rows (:) and columns up to index 2 (exclusive) for features
X = house_data_matrix[:, :2]

# Select all rows (:) and only the last column (index 2) for the target
y = house_data_matrix[:, 2]

print("Feature Matrix X:")
print(X)
print("\nTarget Vector y:")
print(y)

The output confirms the split:

Feature Matrix X:
[[1500    3]
 [2100    4]
 [1200    2]
 [1800    3]]

Target Vector y:
[320000 450000 250000 380000]

Inspecting the Dimensions

A standard practice is to check the shape of your matrices and vectors. This helps confirm that your data is structured correctly and helps prevent errors in subsequent steps.

# Get the dimensions of the feature matrix and target vector
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

The output will be:

Shape of X: (4, 2)
Shape of y: (4,)

This tells us:

X has 4 rows (observations) and 2 columns (features).
y is a vector of length 4, corresponding to the 4 observations.

You have now successfully taken raw data and transformed it into the exact structure needed for machine learning. The feature matrix X and target vector y are the equivalents of the matrix A and vector b in the equation $Ax = b$ we discussed for linear regression. All the matrix and vector operations you learned in previous chapters can now be applied directly to X and y to train a model, find patterns, or reduce dimensionality.

Was this section helpful?

References

NumPy User Guide: Array creation routines, NumPy Developers, 2024 - Official documentation explaining how to create NumPy arrays from various data structures, which is directly applicable to converting raw data into a matrix.
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Aurélien Géron, 2022 (O'Reilly Media) - A practical guide that introduces how data is prepared and structured (feature matrix X and target vector y) for machine learning models using Python libraries.
Mathematics for Machine Learning, Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, 2020 (Cambridge University Press) DOI: 10.1017/9781108679901 - Provides the mathematical foundations for machine learning, including how linear algebra concepts apply to data representation (matrices and X, y vectors).