This section guides you through applying clustering algorithms and Principal Component Analysis (PCA) using Julia. We'll put theory into practice by working with a common dataset, performing K-Means clustering, reducing data dimensionality with PCA, and then evaluating the quality of our clustering results. This hands-on experience will solidify your understanding of how these unsupervised learning techniques are implemented and used in a typical machine learning workflow.
First, ensure you have the necessary Julia packages installed. If not, you can add them using Julia's package manager:
using Pkg
Pkg.add(["MLJ", "MLJModels", "DataFrames", "RDatasets", "Plots", "Clustering", "MultivariateStats", "Random", "Printf"])
Once installed, we can load them into our Julia session.
using MLJ
using DataFrames
using RDatasets
using Plots
using Clustering # For silhouette scores
using Distances # For distance matrix with silhouette scores
using MultivariateStats # For direct PCA if not using MLJ's wrapper
using Random
using Printf
# Set a random seed for reproducibility
Random.seed!(1234)
# Load specific models from MLJModels
KMeans = @load KMeans pkg=Clustering verbosity=0
PCA = @load PCA pkg=MultivariateStats verbosity=0
We use Random.seed!(1234)
to ensure that operations involving randomness, like K-Means initialization, produce the same results each time we run the code. The @load
macro from MLJ
imports models, and verbosity=0
suppresses output during loading.
We'll use the classic Iris dataset, which is readily available through RDatasets.jl
. This dataset contains measurements for 150 iris flowers, each belonging to one of three species. For our unsupervised learning tasks, we will only use the feature measurements and ignore the species labels, pretending we don't know them.
# Load the Iris dataset
iris = RDatasets.dataset("datasets", "iris")
# Separate features (X) from the dataset. We'll ignore Species for unsupervised tasks.
X = select(iris, Not(:Species))
# Display the first few rows and a summary of the features
println("First 5 rows of features:")
show(stdout, "text/plain", first(X, 5))
println("\n\nSummary of features:")
show(stdout, "text/plain", describe(X))
The output will show four features: SepalLength
, SepalWidth
, PetalLength
, and PetalWidth
. These are the attributes we'll use for clustering and dimensionality reduction.
K-Means is an algorithm that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centroid). Let's apply K-Means to our 4-dimensional Iris feature data. We'll choose k=3 clusters, as we (externally) know there are three species of Iris flowers.
# Instantiate the K-Means model, specifying 3 clusters
kmeans_model = KMeans(k=3)
# Create an MLJ machine binding the model to the data
mach_kmeans = machine(kmeans_model, X)
# Fit the K-Means model
fit!(mach_kmeans)
# Get the cluster assignments for each data point
assignments = predict(mach_kmeans, X)
# MLJ's predict returns a CategoricalArray. For some uses, we might want integer labels.
assignments_int = MLJ.levelcode.(assignments)
# Get the cluster centroids
centroids = report(mach_kmeans).centroids
println("\n\nCluster Centroids (4D):")
show(stdout, "text/plain", centroids)
The report(mach_kmeans).centroids
gives us the coordinates of the three cluster centers in the original 4-dimensional space. Each row in centroids
represents a center.
Visualizing clusters in 4D directly is not feasible. However, we can create a scatter plot of two features, say PetalLength
and PetalWidth
, and color the points by their assigned cluster.
# Visualize clusters using two features: PetalLength vs PetalWidth
scatter(X.PetalLength, X.PetalWidth, group=assignments,
xlabel="Petal Length", ylabel="Petal Width",
title="K-Means Clustering on Iris (Petal Features)",
legend=:topleft, palette=[:blue, :green, :orange]) # Using palette for distinct colors
A scatter plot showing Iris data points based on Petal Length and Petal Width, colored by their K-Means assigned cluster.
This plot gives a partial view of the clustering. Points with the same color are grouped together by K-Means.
Our Iris dataset has four features. PCA can help us reduce this to a lower number of dimensions (e.g., two) while retaining most of the variance in the data. This is useful for visualization and can sometimes improve the performance of subsequent machine learning algorithms.
# Instantiate PCA model to reduce to 2 dimensions
pca_model = PCA(maxoutdim=2) # Or pratio=0.95 to retain 95% variance
# Create an MLJ machine for PCA
mach_pca = machine(pca_model, X)
# Fit PCA model
fit!(mach_pca)
# Transform the original data to its 2 principal components
X_pca = transform(mach_pca, X) # X_pca will be a table
# Display the first few rows of the PCA-transformed data
println("\n\nFirst 5 rows of PCA-transformed data (2D):")
show(stdout, "text/plain", first(X_pca, 5))
# The report contains information like explained variance
pca_report = report(mach_pca)
@printf("\nCumulative variance explained by 2 components: %.2f%%\n", pca_report.cumulative_variance[2] * 100)
The transform
function projects the original 4D data onto the first two principal components. The report from the PCA machine tells us how much of the original data's variance is captured by these two components. Typically, a high percentage (e.g., >80-90%) indicates a good reduction.
Now, let's visualize the data in this new 2D space derived from PCA.
# Convert X_pca table to a matrix for easier plotting with Plots.jl if needed
# X_pca is a table, so access columns by name (e.g., :x1, :x2, ...)
# The names are generic like x1, x2, ...
col_names = Tables.columnnames(X_pca)
scatter(Tables.getcolumn(X_pca, col_names[1]), Tables.getcolumn(X_pca, col_names[2]),
xlabel="Principal Component 1", ylabel="Principal Component 2",
title="Iris Data after PCA (2D)", legend=false,
color=:gray, markersize=4)
Data points from the Iris dataset after being transformed by PCA into a two-dimensional space. Each point represents an iris flower.
With our data reduced to two dimensions, we can now apply K-Means clustering again. This is common practice: reduce dimensions, then cluster.
# Apply K-Means (k=3) on the 2D PCA-transformed data
mach_kmeans_pca = machine(kmeans_model, X_pca) # kmeans_model is already KMeans(k=3)
fit!(mach_kmeans_pca)
# Get cluster assignments on PCA data
assignments_pca = predict(mach_kmeans_pca, X_pca)
assignments_pca_int = MLJ.levelcode.(assignments_pca)
# Get centroids in the 2D PCA space
centroids_pca = report(mach_kmeans_pca).centroids
println("\n\nCluster Centroids (2D PCA space):")
show(stdout, "text/plain", centroids_pca)
# Visualize clusters on the 2D PCA plot
scatter(Tables.getcolumn(X_pca, col_names[1]), Tables.getcolumn(X_pca, col_names[2]), group=assignments_pca,
xlabel="Principal Component 1", ylabel="Principal Component 2",
title="K-Means Clustering on PCA-reduced Iris Data",
legend=:topleft, palette=[:blue, :green, :orange]) # Match palette if comparing
PCA-transformed Iris data points colored according to the cluster assignments from K-Means applied to this 2D data. Distinct colors represent different clusters.
This plot should show clearer visual separation if the clusters align well with the principal components.
How good are our clusters? For unsupervised learning, evaluation can be tricky since we don't have ground truth labels (though for Iris, we secretly do). Internal evaluation metrics assess the quality of the clustering structure itself. The Silhouette Score is a popular one. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Scores range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
We will use the silhouettes
function from Clustering.jl
. It requires the data as a matrix, integer cluster assignments, and a distance matrix.
# Convert features to matrix format for distance calculation
X_matrix = MLJ.matrix(X) # Original 4D data
X_pca_matrix = MLJ.matrix(X_pca) # PCA-reduced 2D data
# Calculate Silhouette Score for K-Means on original 4D data
# assignments_int are from K-Means on original X
dist_matrix_4d = pairwise(Euclidean(), X_matrix', dims=2)
sils_4d = silhouettes(assignments_int, dist_matrix_4d)
mean_silhouette_4d = mean(sils_4d)
@printf("\nMean Silhouette Score (K-Means on 4D data): %.3f\n", mean_silhouette_4d)
# Calculate Silhouette Score for K-Means on PCA-reduced 2D data
# assignments_pca_int are from K-Means on X_pca
dist_matrix_2d = pairwise(Euclidean(), X_pca_matrix', dims=2)
sils_2d = silhouettes(assignments_pca_int, dist_matrix_2d)
mean_silhouette_2d = mean(sils_2d)
@printf("Mean Silhouette Score (K-Means on 2D PCA data): %.3f\n", mean_silhouette_2d)
Comparing mean_silhouette_4d
and mean_silhouette_2d
can give insights. Sometimes, clustering on PCA-reduced data yields better (or more stable) silhouette scores, especially if the removed components were mostly noise. In other cases, information loss from PCA might degrade clustering quality.
For Iris, clustering on the PCA-reduced data often yields good silhouette scores because the first two principal components capture the species separation very well.
While this practical focused on K-Means, Julia's ecosystem, particularly through MLJ
and Clustering.jl
, supports other unsupervised algorithms. For instance, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is excellent for finding non-spherical clusters and identifying noise points. You could load and use it similarly:
# DBSCANModel = @load DBSCAN pkg=Clustering verbosity=0
# dbscan_model = DBSCANModel(eps=0.5, min_neighbors=5) # Parameters eps and min_points need tuning
# mach_dbscan = machine(dbscan_model, X_pca) # Or X
# fit!(mach_dbscan)
# assignments_dbscan = predict(mach_dbscan, X_pca)
# ... then visualize and evaluate.
Exploring DBSCAN with different parameters (eps
and min_neighbors
) on the Iris dataset, especially the PCA-reduced version, can be an insightful follow-up exercise.
In this hands-on section, you've successfully:
These steps represent a common workflow in exploratory data analysis and unsupervised machine learning. By working through these examples in Julia, you've gained practical experience with its powerful tools for these tasks, setting a foundation for tackling more complex problems.
Was this section helpful?
© 2025 ApX Machine Learning