While K-Means clustering, as discussed previously, is effective for identifying spherical clusters when the number of clusters k is known, many datasets feature clusters of arbitrary shapes and may contain outliers that don't belong to any specific group. Density-Based Spatial Clustering of Applications with Noise, or DBSCAN, offers an alternative approach that excels in these scenarios. It groups together points that are closely packed, marking points that lie alone in low-density regions as outliers. A significant advantage of DBSCAN is that it doesn't require you to specify the number of clusters beforehand.
DBSCAN's logic revolves around two main parameters:
eps
): This parameter defines the radius of a neighborhood around a data point. If you imagine drawing a circle of radius ϵ around a point, all other points falling within this circle are considered its neighbors.min_pts
): This is the minimum number of data points (including the point itself) that must be present within a point's ϵ-neighborhood for that point to be considered a "core point."Based on these parameters, DBSCAN categorizes each point in the dataset into one of three types:
min_pts
points. Core points are typically found in the dense interior of a cluster.min_pts
in its ϵ-neighborhood) but falls within the ϵ-neighborhood of a core point. Border points lie on the fringes of a cluster.Clusters are then formed by connecting core points that are close to each other. Specifically:
A cluster in DBSCAN is a maximal set of density-connected points. The algorithm iterates through the data points, and when it encounters an unvisited core point, it expands a new cluster from it by finding all density-reachable points. Points classified as noise are not assigned to any cluster.
In Julia, the Clustering.jl
package provides an efficient implementation of DBSCAN. Let's see how to use it. First, ensure you have the package installed:
using Pkg
Pkg.add("Clustering")
Pkg.add("Plots") # For visualization
Now, let's generate some sample data that forms non-spherical clusters, which is where DBSCAN shines, and apply the algorithm.
using Clustering
using Plots
using Random # For reproducible example
# Generate sample data: two moons
Random.seed!(123) # for reproducibility
n_points_per_moon = 150
noise_level = 0.05
# First moon
t1 = range(0, pi, length=n_points_per_moon)
x1 = cos.(t1) .+ Random.randn(n_points_per_moon) .* noise_level
y1 = sin.(t1) .+ Random.randn(n_points_per_moon) .* noise_level
# Second moon (offset and flipped)
t2 = range(0, pi, length=n_points_per_moon)
x2 = 1 .- cos.(t2) .+ Random.randn(n_points_per_moon) .* noise_level
y2 = 0.5 .- sin.(t2) .- Random.randn(n_points_per_moon) .* noise_level
# Combine into a single dataset (features as columns)
data = vcat(hcat(x1, y1)', hcat(x2, y2)')' # Transpose to get features as columns
# Add some random noise points
n_noise = 20
noise_x = (rand(n_noise) .* 3 .- 0.5)
noise_y = (rand(n_noise) .* 2 .- 0.5)
noise_data = hcat(noise_x, noise_y)'
full_data = hcat(data, noise_data) # Data matrix: 2 rows (features), N columns (samples)
# Apply DBSCAN
# Parameters: eps (radius), min_pts (minimum points in neighborhood)
# The data matrix should be d x n, where d is dimensions and n is number of points.
# Our full_data is currently n x d (points as rows, features as columns),
# so we transpose it for Clustering.jl's dbscan.
eps_val = 0.25
min_pts_val = 5
result = dbscan(full_data', eps_val, min_neighbors=min_pts_val)
# `result.assignments` contains the cluster assignment for each point.
# 0 usually denotes noise points.
# `result.counts` gives the number of points in each cluster.
# `result.is_core` is a boolean vector indicating if a point is a core point.
println("Cluster assignments: ", result.assignments)
println("Cluster counts: ", result.counts)
# Visualize the results
# Map cluster assignments to colors. Noise points (assignment 0) get a specific color.
cluster_colors = [ :gray, :blue, :red, :green, :purple, :orange ] # Add more if expecting more clusters
point_colors = map(id -> id == 0 ? cluster_colors[1] : cluster_colors[id+1], result.assignments)
point_markers = map(id -> id == 0 ? :x : :circle, result.assignments)
scatter(full_data[1, :], full_data[2, :],
markercolor=point_colors,
markershape=point_markers,
legend=false,
title="DBSCAN Clustering Results (eps=$eps_val, min_pts=$min_pts_val)",
xlabel="Feature 1", ylabel="Feature 2")
Running this code will apply DBSCAN to the generated dataset. The dbscan
function from Clustering.jl
takes the data matrix (features as rows, samples as columns), ϵ, and min_neighbors
(which corresponds to min_pts
) as primary arguments. The output result
object contains assignments
(an array where each element is the cluster ID for the corresponding data point, with 0 typically indicating noise), counts
(number of points in each cluster), and is_core
(a boolean array).
eps
and min_pts
The performance of DBSCAN is quite sensitive to the choice of eps
and min_pts
.
min_pts
: A general guideline for min_pts
is to set it based on the dimensionality of your data (D). A common starting point is min_pts >= D + 1
. For 2-dimensional data, min_pts = 4
or min_pts = 5
is often a good choice. Larger values of min_pts
tend to produce more significant clusters and classify more points as noise, making the algorithm more resilient to outliers. If min_pts
is too small (e.g., 1 or 2), every point could be a core point, or sparse clusters might merge.
eps
: Selecting eps
is often more challenging.
eps
. This point represents a threshold where a sharp increase in distance occurs, suggesting a transition from denser areas to sparser areas.# Example for k-distance plot (using k = min_pts_val - 1)
# This requires a way to compute k-nearest neighbor distances.
# We can use NearestNeighbors.jl for this.
# Pkg.add("NearestNeighbors")
using NearestNeighbors
k = min_pts_val - 1
kdtree = KDTree(full_data) # Build a KD-tree for efficient neighbor search
# `knn` returns indices and distances. We need the distances.
# For each point, find its k nearest neighbors (excluding itself).
# Since knn includes the point itself if it's in the tree, we ask for k+1 neighbors
# and take the k-th distance if k > 0, or handle k=0 if min_pts=1.
if k > 0
idxs, dists = knn(kdtree, full_data, k + 1, true) # true to sort results
# dists is a vector of vectors; we need the k-th distance for each point
k_distances = [d[k+1] for d in dists] # k+1 because knn result is 1-indexed, d[1] is point itself
# and we need the k-th *other* point, so index k+1
sort!(k_distances)
plot(k_distances,
xlabel="Points (sorted by distance)",
ylabel="$k-th Nearest Neighbor Distance",
title="k-Distance Plot (k=$k)",
legend=false)
else
println("k must be > 0 for k-distance plot based on min_pts - 1.")
end
You would then look for an "elbow" in the generated k-distance plot to estimate eps
.
eps
choice.eps
involves some experimentation. Try a few values and evaluate the quality of the resulting clusters, perhaps visually or using cluster evaluation metrics if ground truth is unavailable.Advantages:
Disadvantages:
eps
and min_pts
parameters. Finding optimal values can be non-trivial.eps
and min_pts
setting might not be appropriate for all clusters. Some dense clusters might be merged, or sparser clusters might be missed or broken up.Visualizing the output is a great way to understand how DBSCAN has partitioned your data. For 2D or 3D data, scatter plots are very effective.
Example of DBSCAN identifying two crescent-shaped clusters and marking some points as noise.
DBSCAN provides a powerful, density-based approach to clustering, particularly useful when dealing with complex data structures and the presence of noise. Its ability to discover clusters of varying shapes without prior specification of their count makes it a valuable tool in the unsupervised learning toolkit. However, careful consideration of its parameters is important for achieving meaningful results.
Was this section helpful?
© 2025 ApX Machine Learning