Okay, let's put everything we've learned together! In this section, we'll walk through a complete example of building a machine learning model. We'll use the popular Scikit-learn library in Python, which makes many of these steps straightforward. Remember the workflow we discussed? We'll follow it step-by-step: load data, prepare it, choose and train a model, make predictions, and evaluate the results.
For this example, we'll tackle a classification problem using a well-known dataset called the Iris dataset. This dataset contains measurements for different species of Iris flowers. Our goal is to build a model that can predict the species of an Iris flower based on its measurements.
First, make sure you have Scikit-learn installed. If you're using Anaconda, it's likely already included. If not, you can typically install it using pip:
pip install scikit-learn numpy pandas matplotlib seaborn
We'll also use NumPy for numerical operations, Pandas for data handling (though Scikit-learn can load Iris directly), and Matplotlib/Seaborn for potential visualization, like our confusion matrix later.
Scikit-learn comes with several built-in datasets, including Iris, which makes loading it very simple.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load the Iris dataset
iris = load_iris()
X = iris.data # Features (sepal length, sepal width, petal length, petal width)
y = iris.target # Target variable (species encoded as 0, 1, 2)
# For clarity, let's see the feature names and target names
print("Feature names:", iris.feature_names)
print("Target names:", iris.target_names)
print("Data shape (samples, features):", X.shape)
print("Target shape (samples,):", y.shape)
print("\nFirst 5 samples:\n", X[:5])
print("First 5 targets:", y[:5])
The output shows we have 150 samples (flowers) and 4 features for each. The target y
contains numbers (0, 1, 2) representing the species ('setosa', 'versicolor', 'virginica').
Before training, we need to split our data into a training set (for the model to learn from) and a testing set (to evaluate how well it learned). A common split is 80% for training and 20% for testing.
We also need to scale our features. Algorithms like K-Nearest Neighbors (which we'll use) rely on the distance between data points. If features have vastly different ranges (e.g., one from 0-1 and another from 0-1000), the feature with the larger range can dominate the distance calculation. Scaling brings all features to a similar range. We'll use StandardScaler
, which transforms data to have zero mean and unit variance.
Important: We fit the scaler only on the training data to prevent information from the test set leaking into the training process. Then, we use the same fitted scaler to transform both the training and testing data.
# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
# Initialize the StandardScaler
scaler = StandardScaler()
# Fit the scaler on the training data ONLY
scaler.fit(X_train)
# Transform both the training and testing data using the fitted scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Display the first few rows of scaled data to see the effect
print("\nFirst 5 scaled training samples:\n", X_train_scaled[:5])
Notice how the values in X_train_scaled
are now centered around zero.
Now, we select an algorithm. Since this is a classification problem (predicting a category/species), and we covered K-Nearest Neighbors (KNN) earlier, let's use that. KNN classifies a new data point based on the majority class among its 'k' nearest neighbors in the feature space.
We need to choose a value for 'k' (the number of neighbors). A common starting point is k=3 or k=5. Let's use k=5.
# Choose the model: K-Nearest Neighbors Classifier
# Instantiate the model with k=5 neighbors
knn_model = KNeighborsClassifier(n_neighbors=5)
# Train the model using the scaled training data
# The 'fit' method is where the learning happens
knn_model.fit(X_train_scaled, y_train)
print("\nModel training complete.")
That's it! The fit
method has stored the training data (or learned patterns from it, depending on the model type). For KNN, it essentially memorizes the positions of the training data points in the scaled feature space.
With our trained model, we can now predict the species for the flowers in our test set. Remember, the model hasn't seen this test data during training. We use the predict
method.
# Use the trained model to make predictions on the scaled test data
y_pred = knn_model.predict(X_test_scaled)
# Display the predictions for the first 10 test samples
print("\nPredicted species for first 10 test samples:", y_pred[:10])
# Display the actual species for the first 10 test samples
print("Actual species for first 10 test samples: ", y_test[:10])
The model outputs an array y_pred
containing the predicted species (0, 1, or 2) for each sample in the test set X_test_scaled
. We can compare these predictions to the actual values y_test
to see how well the model did.
Comparing predictions one by one is tedious. We need quantitative metrics. For classification, accuracy is a common starting point. It tells us the proportion of predictions that were correct.
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy on the Test Set: {accuracy:.4f}")
An accuracy of 1.0 would mean perfect prediction on the test set, while 0.0 would mean completely incorrect. Our KNN model performs quite well on this dataset.
For a more detailed look, we can use a confusion matrix. It shows how many samples were correctly classified for each class and where misclassifications occurred. The rows typically represent the actual classes, and the columns represent the predicted classes.
# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
# For better visualization, create a DataFrame and use Seaborn heatmap
cm_df = pd.DataFrame(cm, index=iris.target_names, columns=iris.target_names)
plt.figure(figsize=(7, 5))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues') # Using Blues colormap
plt.title('Confusion Matrix')
plt.ylabel('Actual Species')
plt.xlabel('Predicted Species')
plt.tight_layout() # Adjust layout to prevent clipping
plt.show() # Display the plot
# Print the confusion matrix values as well
print("\nConfusion Matrix:\n", cm_df)
The confusion matrix shows the counts of correct and incorrect predictions for each Iris species. In this case, all test samples were classified correctly.
The diagonal elements (from top-left to bottom-right) show the number of correct predictions for each class (setosa, versicolor, virginica). Off-diagonal elements show misclassifications. For example, if the cell at row 'versicolor' and column 'virginica' had a '1', it would mean one actual 'versicolor' flower was incorrectly predicted as 'virginica'. In our case, the perfect accuracy score is reflected in the confusion matrix having zeros everywhere except the main diagonal.
Congratulations! You've just walked through building, training, and evaluating your first machine learning model end-to-end. We performed these steps:
load_iris()
from Scikit-learn.train_test_split
) and scaled features (StandardScaler
).KNeighborsClassifier
..fit()
method on scaled training data..predict()
method on scaled test data.accuracy_score
and visualized the confusion_matrix
.This workflow provides a solid foundation. While we used KNN here, the basic process (load, split, scale, fit, predict, evaluate) remains similar for many other supervised learning algorithms in Scikit-learn, just involving different model classes and potentially different evaluation metrics.
© 2025 ApX Machine Learning