While K-Means is effective for finding spherical or convex-shaped clusters when the number of clusters is known beforehand, it struggles with datasets containing arbitrarily shaped groupings or significant noise. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) offers a different approach, defining clusters as continuous regions of high point density, separated by regions of low density. This makes it adept at identifying clusters of complex shapes and automatically handling outliers as noise.
Imagine points scattered on a map. Clusters can be thought of as crowded areas (high density), while the spaces between them are relatively empty (low density). DBSCAN formalizes this idea using two main parameters:
eps
): This defines a radius around each data point. The eps
-neighborhood of a point consists of all other points within this distance.min_samples
): This specifies the minimum number of points required within a point's eps
-neighborhood (including the point itself) for it to be considered a "core point" – essentially, a point in a dense region.Based on these parameters, DBSCAN classifies each point into one of three types:
min_samples
points within its eps
-neighborhood. These points are the heart of a cluster.eps
-neighborhood of a core point but does not have min_samples
points within its own neighborhood. Border points lie on the edges of clusters.Clusters are formed by connecting core points that are neighbors (within eps
distance of each other). Border points are then assigned to the cluster of a nearby core point. Any point not reachable by following these connections from a core point is labeled as noise. A significant advantage here is that DBSCAN doesn't force every point into a cluster; it explicitly identifies noise.
Scikit-learn provides a straightforward implementation of DBSCAN. Let's see how it works on a dataset where K-Means might perform poorly, like the "two moons" dataset.
First, we generate the data and visualize it:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import seaborn as sns
# Generate sample data
X, y = make_moons(n_samples=300, noise=0.1, random_state=42)
# Scale the data (important for distance-based algorithms like DBSCAN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Visualize the original scaled data
plt.figure(figsize=(8, 5))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], s=50, alpha=0.7)
plt.title('Scaled Two Moons Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Now, we apply DBSCAN. Choosing eps
and min_samples
is important, and we'll discuss strategies for this shortly. For now, let's try some plausible values:
# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
clusters = dbscan.fit_predict(X_scaled) # Use fit_predict for convenience
# Get core sample indices and labels
core_samples_mask = np.zeros_like(dbscan.labels_, dtype=bool)
core_samples_mask[dbscan.core_sample_indices_] = True
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print(f'Estimated number of clusters: {n_clusters_}')
print(f'Estimated number of noise points: {n_noise_}')
# Visualize the clustering results
plt.figure(figsize=(8, 5))
# Plot non-noise points
unique_labels = set(labels)
colors = plt.cm.viridis(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Use black for noise points
col = [0, 0, 0, 1]
marker = 'x'
markersize = 5
label = 'Noise'
else:
marker = 'o'
markersize = 7
label = f'Cluster {k}'
class_member_mask = (labels == k)
# Plot core samples
xy = X_scaled[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], marker, markerfacecolor=tuple(col),
markeredgecolor='k', markersize=markersize, label=label if k != -1 else "")
# Plot border samples (non-core)
xy = X_scaled[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], marker, markerfacecolor=tuple(col),
markeredgecolor='k', markersize=markersize * 0.6) # Smaller size for border points
# Add noise legend entry separately if noise exists
if n_noise_ > 0:
plt.plot([], [], 'kx', markersize=5, label='Noise')
plt.title(f'DBSCAN Clustering (eps=0.3, min_samples=5)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(loc='best')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Visualization of DBSCAN clustering on the scaled 'two moons' dataset. Core points are shown with black edges, border points are smaller, and noise points are marked with 'x'. DBSCAN successfully identifies the non-convex shapes.
The labels_
attribute contains the cluster assignment for each point. Scikit-learn uses -1
to denote noise points. The core_sample_indices_
attribute gives the indices of the points identified as core points.
eps
and min_samples
The performance of DBSCAN hinges on selecting appropriate values for eps
and min_samples
.
min_samples
: This parameter influences the minimum density required to form a cluster.
min_samples
based on the dimensionality (D) of the data, such as min_samples >= D + 1
or min_samples >= 2 * D
. For 2D data, min_samples
between 3 and 5 is often a good starting point.eps
: This parameter determines the neighborhood size. Choosing it well is often more challenging.
min_samples - 1
.
eps
. It represents a threshold where points start becoming significantly further away from their neighbors, potentially indicating the boundary between dense regions and sparser regions or noise.Let's implement the k-distance graph approach:
from sklearn.neighbors import NearestNeighbors
# Set min_samples (e.g., based on 2*D for 2 dimensions)
min_samples_knn = 4 # k = min_samples - 1 = 5 - 1 = 4
# Calculate distances to the k-th nearest neighbor
nn = NearestNeighbors(n_neighbors=min_samples_knn + 1) # Need k+1 neighbors to get k distances
nn.fit(X_scaled)
distances, indices = nn.kneighbors(X_scaled)
# Get the distance to the k-th neighbor (index k)
k_distances = np.sort(distances[:, min_samples_knn], axis=0)
# Plot the k-distance graph
plt.figure(figsize=(8, 5))
plt.plot(k_distances)
plt.title(f'K-Distance Graph (k={min_samples_knn})')
plt.xlabel('Points sorted by distance')
plt.ylabel(f'{min_samples_knn}-th Nearest Neighbor Distance')
plt.grid(True, linestyle='--', alpha=0.6)
# Optional: Add a line indicating a potential elbow (adjust value based on plot)
elbow_eps = 0.3
plt.axhline(y=elbow_eps, color='r', linestyle='--', label=f'Elbow at eps={elbow_eps}')
plt.legend()
plt.show()
K-distance graph for the scaled 'two moons' dataset with k=4. The y-axis shows the distance to the 4th nearest neighbor for each point, sorted ascendingly. The 'elbow' around a distance of 0.3 suggests a suitable value for
eps
.
The plot shows a distinct elbow. The distance value at this point (around 0.3 in this example) suggests a radius where the density noticeably decreases, making it a good candidate for eps
. Keep in mind that this is a heuristic; you might need to experiment with values around the elbow and evaluate the resulting clusters visually or using cluster validation metrics (if applicable in an unsupervised context).
Advantages:
Disadvantages:
eps
and min_samples
. The k-distance graph helps, but tuning might still be required.eps
and min_samples
are global parameters. Points in less dense clusters might be marked as noise. Variants like OPTICS or HDBSCAN address this limitation.DBSCAN is particularly well-suited for:
By understanding the concepts of density, core points, and reachability, and by using techniques like the k-distance graph to guide parameter selection, you can effectively apply DBSCAN to uncover meaningful patterns in complex, unlabeled datasets where simpler algorithms might fall short.
© 2025 ApX Machine Learning