All Courses

Practice: Applying K-Means to Simple Data

Okay, let's put theory into practice. You've learned about unsupervised learning, clustering, and the mechanics of the K-Means algorithm. Now, we'll walk through applying K-Means to a simple dataset to see it in action. The goal is to group data points without knowing their "true" labels beforehand, relying only on the positions of the points themselves.

We'll use a common approach in introductory examples: generating synthetic data. This is helpful because we can create data where we know there are distinct groups, making it easier to visually check if K-Means does a reasonable job finding them. We will use Python along with popular libraries like Scikit-learn for the K-Means algorithm and Plotly for visualization.

Setting Up Our Tools

First, ensure you have the necessary libraries. If you're working in an environment like Google Colab or Anaconda, these might already be installed. If not, you'd typically install them using pip:

pip install scikit-learn numpy plotly

For this example, we need numpy for numerical operations (especially creating our data), sklearn.cluster for the KMeans algorithm, and plotly.graph_objects for creating interactive plots suitable for the web.

Generating and Visualizing Simple Data

Let's create some 2-dimensional data that clearly falls into three groups or "blobs". Scikit-learn provides a handy function make_blobs for exactly this purpose.

import numpy as np
import plotly.graph_objects as go
from sklearn.datasets import make_blobs

# Generate synthetic data with 3 distinct clusters
X, _ = make_blobs(n_samples=150,  # Total number of points
                  centers=3,      # Number of clusters to generate
                  cluster_std=0.8,# Standard deviation of the clusters (spread)
                  random_state=42)# For reproducibility
                                  # We ignore the second output (y), which are the true labels

# X is now a NumPy array with 150 rows and 2 columns (our features)

# Let's visualize the raw data before clustering
fig_raw = go.Figure(data=[go.Scatter(
    x=X[:, 0],
    y=X[:, 1],
    mode='markers',
    marker=dict(color='#495057', size=7, opacity=0.8) # Use gray for raw data
)])

fig_raw.update_layout(
    title='Synthetic Data Points (Before Clustering)',
    xaxis_title='Feature 1',
    yaxis_title='Feature 2',
    width=600,
    height=450,
    plot_bgcolor='#f8f9fa' # Light background
)

# Display the plot (In a notebook/web environment)
# fig_raw.show()

Before applying K-Means, it's always a good idea to look at your data. Here’s the plot generated by the code above:

{"layout": {"title": "Synthetic Data Points (Before Clustering)", "xaxis_title": "Feature 1", "yaxis_title": "Feature 2", "width": 600, "height": 450, "plot_bgcolor": "#f8f9fa"}, "data": [{"type": "scatter", "x": [9.40487926, -3.8944466, -2.17399939, 0.63447058, 3.0902331, 8.67028461, -4.0307817, -1.82419002, 8.68490888, 8.56820617, 2.32643218, -3.49879463, -3.60856438, 2.29491571, 9.26813307, -1.21909073, -4.28364986, 0.0112588, -1.36399585, 2.33003117, 8.58921288, 10.02922413, -2.82413277, -3.24349636, 9.71135036, 8.06968037, 8.27200317, -2.27073282, 2.51370938, -3.86908011, 7.91291223, 9.39663251, -3.89519813, -3.61645308, 2.93376674, 8.18409341, -2.56506782, 1.85706746, 2.20446538, 8.42895401, 0.94870871, -2.22214035, 9.28596881, -3.09745833, 2.9651487, -3.09142537, 2.62396722, -4.11892455, -3.47499921, 9.78631698, 8.35558648, -3.41549603, 9.28499641, 8.76873352, 2.68645927, 9.50358321, -3.61725122, 2.21014116, -3.38299587, 1.62583857, 8.64980199, 1.23435846, -4.25895783, -3.05677876, -2.99166233, -2.7734411, 10.20133189, 2.41259211, -2.34841319, 8.89124388, -2.57907391, -3.53489394, 10.34603865, -3.18922301, 2.89968801, -3.05746663, 8.94312171, -2.34927089, 1.68502645, 2.35518894, 8.56998805, -1.91377378, 1.3525249, -2.97908147, 2.72385102, 8.08909871, 1.45896167, 8.6176526, 1.98292497, -4.00345846, 2.60171305, 9.8484214, 2.06567807, -4.43579128, 9.30611344, 8.98809746, -3.88682853, 8.39494944, -1.27533232, -1.8886289, 10.13180203, -4.23528217, 8.82484509, -3.01684953, -4.08703417, -3.11201568, -3.09844608, 2.48017873, 9.69327317, -3.94129376, 2.62140818, -3.2776345, -3.68351318, 2.91030311, -2.1229853, 1.68181629, 9.04357012, 7.98905271, -1.68185409, 8.4383966, 2.45147835, -3.20806829, -3.19362707, -2.57485385, 1.97969605, 9.07676821, 9.25474465, -1.39757857, 9.32199849, 2.47611934, 9.25977058, -2.67867522, -2.58810129, -3.59801299, -3.09005407, 8.20591249, 2.85306097, -2.80153218, 1.55067484, 8.69809633, 10.2130389, 8.73647584, 1.56174678, 9.72528665, -3.25816333, 2.9917874, 8.6998064, 8.99835309, 1.32281877], "y": [0.32379694, -2.38218416, 8.16659979, 7.79143032, 7.05315316, -0.54950377, -1.10321427, 8.96916688, 0.71479233, -0.08938143, 8.60753479, -1.61918697, -2.40769042, 6.9011889, -0.34905046, 8.14757385, -2.2290819, 7.39361353, 7.77945849, 7.33166537, 0.38512236, 0.18751794, 7.25888311, -2.1941847, -0.46837488, 0.88659018, 0.62688641, 9.12352141, 7.09137364, -1.59177754, 0.7301589, -0.40308784, -1.60287089, -1.60081525, 7.36820713, 0.92289687, 8.5053366, 6.57942344, 8.1834023, -0.1006295, 7.23553302, 7.32431041, 0.20755535, -1.56254636, 8.33589429, -1.63656999, 7.95017784, -1.46774876, -1.65414337, 0.62850943, 0.13807953, -2.44838893, 0.7287545, 0.86400827, 7.99216678, 0.02840174, -1.45309417, 7.58311015, -2.42025743, 7.73518309, 1.35760308, 7.20993471, -2.11475232, -2.07976469, -2.68137258, 8.14907471, -0.66630451, 7.58580846, 9.09953083, 0.8325716, 7.87177664, -0.8474635, -0.2716906, -1.10928646, 6.61767197, -2.00044627, 0.14767768, 8.60353306, 6.82982971, 8.43919954, -0.31182744, 9.18548535, 7.73797418, -2.53353386, 7.14496464, -0.30212011, 8.35835846, 0.70151865, 7.07295833, -1.6835102, 8.27544208, 0.10867936, 6.61151908, -1.34812434, -0.36474129, 0.21541716, -1.60279546, 0.23526112, 7.48901314, 8.62308333, 0.07945874, -1.44304745, -0.3848356, -2.57820237, -1.17392271, -1.49738819, -2.62645786, 7.67681382, 0.34577225, -2.30683014, 8.38021859, -2.61503097, -1.04255853, 7.7720913, 8.52102629, 7.69048475, 1.06134237, 1.45636823, 7.98726332, 0.61506872, 8.06523303, -0.89907947, -1.37764128, 7.74734064, 7.01837241, 0.38861725, -0.45738783, 7.89803359, 1.38896482, 8.70073113, -0.27467057, 8.34059978, 7.76004224, -1.71469683, -1.28558574, 0.75772998, 7.76503948, 8.61409983, 8.05351342, 0.08395616, -0.24050672, 0.23092453, 7.22386065, 0.82060666, -1.90397304, 7.76976782, 0.15193608, -0.28945617, 7.17659323]}], "mode": "markers", "marker": {"color": "#495057", "size": 7, "opacity": 0.8}}]}

The synthetic data points plotted in 2D space. We can visually identify three potential groups.

As you can see, the points form three reasonably well-separated groups. Our eyes can perform this clustering task quite easily for this simple 2D data. Let's see if K-Means can replicate this.

Applying K-Means

Now we'll use Scikit-learn's KMeans implementation. We need to tell the algorithm how many clusters ( $K$ ) to look for. Since we generated the data with 3 centers, let's set $K=3$ .

from sklearn.cluster import KMeans

# Initialize the K-Means algorithm
# n_clusters is the most important parameter: the number of clusters (K)
# n_init='auto' uses an intelligent default for running the algorithm multiple times
# with different centroid seeds to improve results.
# random_state ensures reproducibility of the initialization.
kmeans = KMeans(n_clusters=3, n_init='auto', random_state=42)

# Fit the algorithm to the data X
# This is where K-Means iterates: assigning points to clusters and updating centroids.
kmeans.fit(X)

# After fitting, the model contains the results:
# 1. Cluster assignments for each data point:
cluster_labels = kmeans.labels_
# 2. Coordinates of the final cluster centers (centroids):
centroids = kmeans.cluster_centers_

# print("Cluster labels assigned to each point:", cluster_labels)
# print("Coordinates of final centroids:\n", centroids)

The fit() method runs the K-Means algorithm on our data X. The algorithm iteratively assigns each point to the nearest centroid and then recalculates the centroid positions based on the assigned points, until the centroids stabilize or a maximum number of iterations is reached.

The results are stored in the kmeans object:

kmeans.labels_: An array where the $i$ -th element indicates the cluster index (0, 1, or 2 in this case) assigned to the $i$ -th data point in X.
kmeans.cluster_centers_: A 2D array containing the final coordinates of the centroids for each cluster.

Visualizing the K-Means Results

Now, let's visualize the same data points, but this time color them according to the cluster labels assigned by K-Means. We'll also plot the final centroids found by the algorithm.

# Define colors for the clusters - using the suggested palette
cluster_colors = ['#4263eb', '#12b886', '#fd7e14'] # Indigo, Teal, Orange
centroid_color = '#f03e3e' # Red for centroids

# Create the plot
fig_clustered = go.Figure()

# Add data points, colored by cluster label
for i in range(3): # Loop through clusters 0, 1, 2
    points_in_cluster = X[cluster_labels == i]
    fig_clustered.add_trace(go.Scatter(
        x=points_in_cluster[:, 0],
        y=points_in_cluster[:, 1],
        mode='markers',
        marker=dict(color=cluster_colors[i], size=7, opacity=0.8),
        name=f'Cluster {i}'
    ))

# Add the centroids
fig_clustered.add_trace(go.Scatter(
    x=centroids[:, 0],
    y=centroids[:, 1],
    mode='markers',
    marker=dict(color=centroid_color, size=14, symbol='x', line=dict(width=3)),
    name='Centroids'
))

fig_clustered.update_layout(
    title=f'K-Means Clustering Results (K=3)',
    xaxis_title='Feature 1',
    yaxis_title='Feature 2',
    width=600,
    height=450,
    plot_bgcolor='#f8f9fa',
    legend_title_text='Legend'
)

# Display the plot
# fig_clustered.show()

Here's the resulting plot:

The same data points, now colored according to the cluster assigned by K-Means with $K=3$ . The red 'x' markers indicate the final positions of the cluster centroids.

Interpretation

Compare the K-Means result plot with the initial plot of the raw data. You should see that K-Means has successfully identified the three distinct groups present in our synthetic data. Each color represents a cluster found by the algorithm, and the red 'x' marks show the center (mean position) of all points belonging to that cluster.

In this simple case, where the clusters are well-separated and roughly spherical, K-Means performs very well.

What if We Chose a Different K?

Remember the discussion about choosing $K$ ? Let's briefly consider what might happen if we instructed K-Means to find, say, $K=2$ clusters in this data. The algorithm would still run, but it would be forced to partition the three visible groups into only two clusters. Typically, it might merge two of the original groups or split one group across the two resulting clusters, depending on the initial centroid placement. Similarly, choosing $K=4$ would force the algorithm to split one or more of the natural groups into smaller, potentially less meaningful clusters. This exercise highlights that while K-Means is effective at partitioning data, the choice of $K$ significantly influences the outcome and its interpretation.

Summary

In this practice section, you've applied the K-Means algorithm to a simple, visually intuitive dataset. You saw how to:

Generate synthetic data suitable for clustering practice using make_blobs.
Visualize the raw data to understand its structure.
Initialize and fit the KMeans model from Scikit-learn, specifying the desired number of clusters ( $K$ ).
Extract the cluster assignments (labels_) and centroid locations (cluster_centers_) from the fitted model.
Visualize the results by coloring data points according to their assigned cluster and plotting the final centroids.

This hands-on example demonstrates the core process of using K-Means for finding groups in unlabeled data. While real-world data is often more complex and higher-dimensional, the fundamental steps remain similar. You now have a practical foundation for understanding how K-Means works and how to implement it using common tools.

Was this section helpful?