High-dimensional datasets, common in data science, present a significant challenge for visualization. Humans perceive the world in three dimensions, making it difficult to directly plot and interpret data with tens, hundreds, or even thousands of features. Yet, visualizing data is a powerful tool for exploratory analysis, helping us identify patterns, spot clusters, detect outliers, and generally understand the underlying structure within unlabeled data. Dimensionality reduction techniques specifically tailored for visualization aim to map high-dimensional data into a lower-dimensional space (typically 2D or 3D) while preserving meaningful relationships between data points.
This section explores two widely used techniques for reducing dimensionality for visualization purposes: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE). While PCA was introduced earlier as a general dimensionality reduction method, we revisit it here specifically through the lens of visualization. t-SNE is a technique primarily designed for visualizing high-dimensional data.
PCA achieves dimensionality reduction by finding a new set of orthogonal axes, called principal components, that capture the maximum variance in the data. The first principal component accounts for the largest variance, the second principal component (orthogonal to the first) accounts for the next largest variance, and so on.
For visualization, we typically project the data onto the first two principal components (PC1 and PC2). This 2D representation captures the directions of greatest spread in the original data. While information is lost by discarding the remaining components, this projection often provides a useful overview of the data's structure, potentially revealing clusters or trends.
Remember that PCA is sensitive to the scale of the features. It's standard practice to scale the data (e.g., using StandardScaler
from scikit-learn) before applying PCA.
Let's illustrate projecting data onto the first two principal components using scikit-learn
:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.express as px
import plotly.io as pio
# Load sample data (e.g., Iris dataset)
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
# 1. Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Create a DataFrame for plotting
pca_df = pd.DataFrame(data = X_pca, columns = ['Principal Component 1', 'Principal Component 2'])
pca_df['target'] = y
pca_df['species'] = pca_df['target'].apply(lambda i: target_names[i])
# 3. Visualize the results
fig = px.scatter(pca_df, x='Principal Component 1', y='Principal Component 2',
color='species', title='PCA of Iris Dataset (2 Components)',
labels={'species': 'Species'},
color_discrete_map={ # Use course palette colors
'setosa': '#228be6', # blue
'versicolor': '#51cf66', # green
'virginica': '#be4bdb' # grape
})
fig.update_layout(
xaxis_title="Principal Component 1",
yaxis_title="Principal Component 2",
legend_title="Species",
width=700, # Adjust width for web display
height=500 # Adjust height for web display
)
# Optional: Show explained variance ratio
print(f"Explained variance ratio by component: {pca.explained_variance_ratio_}")
print(f"Total explained variance by 2 components: {np.sum(pca.explained_variance_ratio_):.4f}")
# Display the plot (or generate JSON for web embedding)
# fig.show()
# To generate JSON for embedding:
# print(pio.to_json(fig))
PCA projection of the Iris dataset onto the first two principal components. Colors indicate the true species labels, showing how PCA can separate the groups based on variance. The total explained variance (around 95.8% in this case) indicates how much of the original data's variability is captured by these two components.
PCA provides a linear projection, which is computationally efficient and interpretable in terms of variance. However, it may not effectively separate clusters that are defined by non-linear relationships or local density.
t-SNE is a non-linear dimensionality reduction technique primarily used for visualization. Unlike PCA, which focuses on maximizing variance (preserving global structure), t-SNE aims to preserve the local structure of the data. It models the similarity between high-dimensional data points as conditional probabilities and then tries to find a low-dimensional embedding (typically 2D or 3D) where the similarities between the low-dimensional points closely match the high-dimensional similarities.
t-SNE is particularly effective at revealing clusters in the data. Points that are close together in the high-dimensional space tend to be mapped close together in the low-dimensional space.
Key aspects of t-SNE:
perplexity
: Roughly related to the number of nearest neighbors considered for each point. Typical values are between 5 and 50. It influences the balance between local and global aspects of the data.n_iter
: The number of optimization iterations. Usually needs several hundred iterations (e.g., 1000) to converge.learning_rate
: Controls the step size during optimization.It's important to note that the resulting t-SNE plot is primarily for visual exploration. The distances between apparent clusters in the t-SNE plot may not be meaningful, and the global arrangement can vary between runs or with different perplexity values. Focus on the groupings of points, not their relative positions or sizes.
Here's how to apply t-SNE using scikit-learn
, again using the scaled Iris data:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE # Note: Different module than PCA
import plotly.express as px
import plotly.io as pio
# Load and scale data (as before)
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply t-SNE
# Common parameters: perplexity=30, n_iter=1000
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# Create a DataFrame for plotting
tsne_df = pd.DataFrame(data = X_tsne, columns = ['TSNE Component 1', 'TSNE Component 2'])
tsne_df['target'] = y
tsne_df['species'] = tsne_df['target'].apply(lambda i: target_names[i])
# Visualize the results
fig_tsne = px.scatter(tsne_df, x='TSNE Component 1', y='TSNE Component 2',
color='species', title='t-SNE Visualization of Iris Dataset',
labels={'species': 'Species'},
color_discrete_map={ # Use course palette colors
'setosa': '#228be6', # blue
'versicolor': '#51cf66', # green
'virginica': '#be4bdb' # grape
})
fig_tsne.update_layout(
xaxis_title="t-SNE Component 1",
yaxis_title="t-SNE Component 2",
legend_title="Species",
width=700, # Adjust width for web display
height=500 # Adjust height for web display
)
# Display the plot (or generate JSON for web embedding)
# fig_tsne.show()
# To generate JSON for embedding:
# print(pio.to_json(fig_tsne))
t-SNE projection of the Iris dataset. Notice how t-SNE often produces more distinct and well-separated clusters compared to PCA, effectively capturing the local similarities between data points of the same species.
Often, PCA might be applied first to reduce dimensions significantly (e.g., to 50 components) before applying t-SNE, which can improve t-SNE's performance and reduce noise.
By applying techniques like PCA and t-SNE, you can transform complex, high-dimensional data into interpretable 2D or 3D plots, facilitating the discovery of patterns and structures that would otherwise remain hidden. This visual exploration is an invaluable part of the unsupervised learning toolkit for understanding unlabeled data.
© 2025 ApX Machine Learning