Applying theoretical knowledge to real datasets is fundamental in machine learning. Preparing a dataset for an algorithm involves understanding how it is loaded and structured. This hands-on exercise covers a common initial step in a machine learning workflow: converting raw data into a NumPy matrix.
Let's imagine we have a small dataset for predicting house prices. The data includes the square footage, the number of bedrooms, and the final sale price. In a typical Python application, this data might start as a list of lists.
import numpy as np
# Each inner list represents a house: [Square Footage, Bedrooms, Price]
raw_data = [
[1500, 3, 320000],
[2100, 4, 450000],
[1200, 2, 250000],
[1800, 3, 380000]
]
While this format is readable, it is not optimized for the mathematical operations required by machine learning models. For that, we need to convert it into a NumPy array, which is the standard format for numerical data in Python.
Creating a NumPy matrix, or more accurately, a 2D ndarray, from a list of lists is straightforward using the np.array() function.
# Convert the list of lists into a 2D NumPy array
house_data_matrix = np.array(raw_data)
print(house_data_matrix)
This will produce the following output:
[[ 1500 3 320000]
[ 2100 4 450000]
[ 1200 2 250000]
[ 1800 3 380000]]
Now our data is in a structured grid. Each row is a single observation (a house), and each column represents a specific attribute. This is the data matrix format that nearly every machine learning algorithm expects.
In supervised learning, we distinguish between the features (the inputs used to make a prediction) and the target (the value we want to predict).
We can use NumPy's powerful slicing capabilities to separate our house_data_matrix into X and y.
The original data matrix is split into a feature matrix
Xand a target vectory.
Here is how to perform this split in code:
# Select all rows (:) and columns up to index 2 (exclusive) for features
X = house_data_matrix[:, :2]
# Select all rows (:) and only the last column (index 2) for the target
y = house_data_matrix[:, 2]
print("Feature Matrix X:")
print(X)
print("\nTarget Vector y:")
print(y)
The output confirms the split:
Feature Matrix X:
[[1500 3]
[2100 4]
[1200 2]
[1800 3]]
Target Vector y:
[320000 450000 250000 380000]
A standard practice is to check the shape of your matrices and vectors. This helps confirm that your data is structured correctly and helps prevent errors in subsequent steps.
# Get the dimensions of the feature matrix and target vector
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
The output will be:
Shape of X: (4, 2)
Shape of y: (4,)
This tells us:
X has 4 rows (observations) and 2 columns (features).y is a vector of length 4, corresponding to the 4 observations.You have now successfully taken raw data and transformed it into the exact structure needed for machine learning. The feature matrix X and target vector y are the equivalents of the matrix A and vector b in the equation Ax=b we discussed for linear regression. All the matrix and vector operations you learned in previous chapters can now be applied directly to X and y to train a model, find patterns, or reduce dimensionality.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with