Building upon the general concept of unsupervised learning introduced at the start of this chapter, we now turn our attention to a specific and widely used technique for partitioning data: K-Means clustering. The objective of K-Means is straightforward: to group data points into a predefined number (k) of distinct, non-overlapping clusters based on their similarity. Data points within the same cluster should be similar to each other, while points in different clusters should be dissimilar.
K-Means is an iterative algorithm that aims to minimize the distance between data points and the center of the cluster they belong to. The "center" of a cluster is known as its centroid, which is typically calculated as the mean of all the data points within that cluster.
The algorithm operates through these primary steps:
The algorithm's goal is to minimize the Within-Cluster Sum of Squares (WCSS), often referred to as inertia. This is the sum of squared distances between each data point and the centroid of its assigned cluster:
WCSS=i=1∑kx∈Ci∑∣∣x−μi∣∣2where k is the number of clusters, Ci is the set of points in cluster i, μi is the centroid of cluster i, and ∣∣x−μi∣∣2 is the squared Euclidean distance.
Scikit-learn provides a convenient implementation of K-Means via the sklearn.cluster.KMeans
class. Let's walk through a practical example.
First, we need some data. We'll use scikit-learn's make_blobs
function to generate synthetic data with distinct clusters, which is useful for illustrating how K-Means works.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Generate synthetic data with 3 centers
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.80, random_state=42)
# Scale the data (important for K-Means)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Instantiate KMeans
# We specify n_clusters (k) = 3, since we know the true structure for this example
# n_init=10 runs the algorithm 10 times with different centroid seeds
# The final result will be the best output in terms of inertia
kmeans = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
# Fit the model to the scaled data
kmeans.fit(X_scaled)
# Get the cluster assignments and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# --- Visualization using Plotly ---
# Create a scatter plot for the data points, colored by predicted cluster label
fig = go.Figure()
# Add data points
fig.add_trace(go.Scatter(
x=X_scaled[:, 0], y=X_scaled[:, 1],
mode='markers',
marker=dict(
color=labels, # Assign color based on cluster labels
colorscale=['#1f77b4', '#ff7f0e', '#2ca02c'], # Example colors
opacity=0.8,
line=dict(width=0.5, color='#495057')
),
name='Data Points'
))
# Add centroids
fig.add_trace(go.Scatter(
x=centroids[:, 0], y=centroids[:, 1],
mode='markers',
marker=dict(
color='#d62728', # Red color for centroids
size=12,
symbol='x',
line=dict(width=2, color='#FFFFFF') # White border for visibility
),
name='Centroids'
))
# Update layout for better appearance
fig.update_layout(
title='K-Means Clustering Results (k=3)',
xaxis_title='Feature 1 (Scaled)',
yaxis_title='Feature 2 (Scaled)',
showlegend=True,
plot_bgcolor='#e9ecef', # Light gray background
width=700,
height=500
)
# To display the plot (in a Jupyter environment, for example):
# fig.show()
# Or save to HTML:
# fig.write_html("kmeans_clusters.html")
# The Plotly JSON object (for embedding in web contexts)
plotly_json = fig.to_json()
The scatter plot shows the data points colored according to the cluster assigned by K-Means. The red 'x' markers indicate the final positions of the cluster centroids.
In the code:
n_clusters=3
: We tell K-Means to find 3 clusters.init='k-means++'
: Uses the k-means++ initialization method for better centroid starting points.n_init=10
: Runs the entire K-Means algorithm 10 times with different random initializations and returns the best result based on WCSS (inertia). This helps mitigate the issue of getting stuck in a suboptimal solution due to poor initialization.random_state=42
: Ensures reproducibility of results..fit(X_scaled)
: Trains the K-Means model on the scaled data..labels_
: An array containing the cluster index (0, 1, or 2 in this case) assigned to each data point..cluster_centers_
: A NumPy array containing the coordinates of the final centroids.A significant challenge with K-Means is that you usually don't know the optimal number of clusters (k) beforehand. The Elbow Method is a common heuristic used to help select a good value for k.
The idea is to run K-Means for a range of different k values (e.g., from 1 to 10) and calculate the WCSS (inertia) for each run. We then plot WCSS against k. As k increases, the WCSS will generally decrease because points will be closer to the centroids of smaller clusters. However, the rate of decrease usually slows down significantly after a certain point, forming an "elbow" shape in the plot. The value of k at this elbow point is often considered a good indicator of the appropriate number of clusters.
Here's how to implement the Elbow Method:
import plotly.graph_objects as go
# Calculate WCSS (inertia) for different values of k
wcss = []
k_values = range(1, 11) # Test k from 1 to 10
for k in k_values:
kmeans_elbow = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
kmeans_elbow.fit(X_scaled)
wcss.append(kmeans_elbow.inertia_) # inertia_ attribute stores WCSS
# --- Visualization using Plotly ---
fig_elbow = go.Figure()
fig_elbow.add_trace(go.Scatter(
x=list(k_values),
y=wcss,
mode='lines+markers',
marker=dict(color='#1c7ed6', size=8),
line=dict(color='#1c7ed6', width=2)
))
# Highlight the potential elbow point (e.g., k=3 in this case)
elbow_k = 3
fig_elbow.add_vline(x=elbow_k, line_width=2, line_dash="dash", line_color="#fa5252",
annotation_text=f"Potential Elbow (k={elbow_k})", annotation_position="top right")
fig_elbow.update_layout(
title='Elbow Method for Optimal k',
xaxis_title='Number of Clusters (k)',
yaxis_title='Within-Cluster Sum of Squares (WCSS)',
plot_bgcolor='#e9ecef',
width=700,
height=400
)
# fig_elbow.show()
plotly_elbow_json = fig_elbow.to_json()
Plot showing WCSS against the number of clusters (k). The "elbow" typically indicates a reasonable trade-off between minimizing WCSS and keeping the number of clusters manageable. In this synthetic example, the elbow is clearly visible at k=3, matching the true number of clusters we generated.
Finding the exact elbow point can sometimes be subjective, especially with real-world data where the curve might be smoother. It serves as a guideline rather than a definitive rule. Other techniques, like the Silhouette Score, provide alternative ways to evaluate cluster quality and select k.
K-Means relies on distance calculations (typically Euclidean) to assign points to clusters and update centroids. If features have vastly different scales (e.g., one feature ranges from 0 to 1, another from 1,000 to 100,000), the feature with the larger range will disproportionately influence the distance calculation and, consequently, the clustering outcome.
Therefore, it's standard practice to scale or normalize your numerical features before applying K-Means. Techniques like StandardScaler
(which standardizes features to have zero mean and unit variance, as used in the code example) or MinMaxScaler
(which scales features to a fixed range, typically [0, 1]) are commonly used. This ensures all features contribute more equally to the distance computations. Remember to fit the scaler on your training data and use it to transform both training and any new data before clustering.
While powerful and efficient, K-Means operates under certain assumptions that can limit its effectiveness in some scenarios:
n_init
parameter) helps find a more stable solution.Understanding these limitations is important for interpreting results and deciding if K-Means is the right choice for your specific dataset and problem. For datasets with non-spherical clusters or varying densities, other algorithms like DBSCAN (discussed next) might be more suitable.
© 2025 ApX Machine Learning