In the previous section, we explored vector norms, which provide a way to measure the magnitude or "length" of a single vector. This concept is foundational for understanding how "far apart" two vectors are in their vector space. Calculating the distance between vectors is a frequent operation in machine learning, often used to determine the similarity or dissimilarity between data points represented as vectors.
Think about a simple classification task: identifying whether a new email is spam or not based on its features (like word frequencies). We might represent the new email and existing emails as vectors in a high-dimensional space. To classify the new email, we could find the existing emails that are "closest" to it in this space. This idea is central to algorithms like k-Nearest Neighbors (k-NN). Similarly, clustering algorithms group data points based on their proximity to each other.
The distance between two vectors, u and v, in a vector space is typically defined as the norm of their difference vector, u−v. Just as there are different ways to measure the length of a single vector using various norms, there are corresponding ways to measure the distance between two vectors.
The most common way to measure the distance between two vectors is the Euclidean distance, derived from the L2 norm. It represents the straight-line distance between the points defined by the vectors in the feature space. If u=(u1,u2,...,un) and v=(v1,v2,...,vn), the Euclidean distance d2(u,v) is:
d2(u,v)=∣∣u−v∣∣2=∑i=1n(ui−vi)2
This is essentially the Pythagorean theorem generalized to multiple dimensions. It's the distance you would measure with a ruler if you could physically plot the vectors.
Another useful distance metric is the Manhattan distance (also called city block distance or taxicab distance), derived from the L1 norm. It measures the distance by summing the absolute differences of the vector components. Imagine navigating a city grid where you can only travel along horizontal or vertical streets; the Manhattan distance is the total distance traveled.
d1(u,v)=∣∣u−v∣∣1=∑i=1n∣ui−vi∣
The Manhattan distance is sometimes preferred over Euclidean distance in high-dimensional spaces or when dealing with features that have different units or scales, as it's less sensitive to large differences in a single dimension compared to the Euclidean distance (which squares the differences).
Consider two vectors in a 2D space, u=(2,5) and v=(6,2).
Euclidean distance (blue line) is the direct path. Manhattan distance (orange path) follows the grid lines, summing the horizontal and vertical legs.
The Euclidean distance is d2(u,v)=(6−2)2+(2−5)2=42+(−3)2=16+9=25=5. The Manhattan distance is d1(u,v)=∣6−2∣+∣2−5∣=∣4∣+∣−3∣=4+3=7.
NumPy makes calculating these distances straightforward. The numpy.linalg.norm
function, which we used for vector norms, can directly compute the norm of the difference vector.
import numpy as np
# Define two vectors (as NumPy arrays)
u = np.array([2, 5])
v = np.array([6, 2])
# Calculate the difference vector
difference = u - v
print(f"Difference vector (u - v): {difference}") # Output: [-4 3]
# Calculate Euclidean (L2) distance
l2_distance = np.linalg.norm(difference) # Default norm is L2
# Alternatively: np.linalg.norm(difference, ord=2)
print(f"Euclidean (L2) Distance: {l2_distance}") # Output: 5.0
# Calculate Manhattan (L1) distance
l1_distance = np.linalg.norm(difference, ord=1)
print(f"Manhattan (L1) Distance: {l1_distance}") # Output: 7.0
# Example with higher dimensional vectors
feature_vec1 = np.array([0.1, 1.5, -2.3, 0.8])
feature_vec2 = np.array([0.3, 1.0, -2.0, 1.1])
diff_features = feature_vec1 - feature_vec2
l2_dist_features = np.linalg.norm(diff_features)
l1_dist_features = np.linalg.norm(diff_features, ord=1)
print(f"\nFeature Vector 1: {feature_vec1}")
print(f"Feature Vector 2: {feature_vec2}")
print(f"Euclidean Distance between features: {l2_dist_features:.4f}")
print(f"Manhattan Distance between features: {l1_dist_features:.4f}")
Understanding how to calculate distances between vectors is fundamental for many machine learning tasks:
By representing data as vectors and using distance metrics, we can quantitatively analyze relationships between data points, enabling algorithms to learn patterns, make predictions, and group information effectively. The choice between Euclidean, Manhattan, or other distance metrics often depends on the specific problem, the nature of the data, and the dimensionality of the feature space.
© 2025 ApX Machine Learning