While understanding the mathematical definition of the gradient is fundamental, calculating it analytically (by finding partial derivatives using differentiation rules) can become tedious or even intractable for the complex functions often encountered in machine learning, especially those defined implicitly by code. Fortunately, we can approximate gradients numerically. This practical section demonstrates how to compute gradients using Python and NumPy, leveraging the concept of finite differences.
Recall that a partial derivative like ∂x∂f measures the rate of change of the function f as x changes slightly, holding other variables constant. We can approximate this rate of change using a small step h:
The central difference formula generally provides a more accurate approximation for a given step size h. We'll use this method to compute each partial derivative needed for the gradient vector. The value h should be small (e.g., 10−5 or 10−7) to approximate the instantaneous rate of change, but not so small that we run into floating-point precision issues.
Let's consider a function that resembles a simple loss function in machine learning. Suppose we have a model with two parameters, w1 and w2, and we want to minimize the following objective function L(w1,w2):
L(w1,w2)=(2w1+3w2−5)2+(w1−w2−1)2
Our goal is to compute the gradient ∇L=[∂w1∂L,∂w2∂L] numerically at a specific point, say (w1,w2)=(1,1).
We can write a Python function that takes any multivariable function func
, a point point
(represented as a NumPy array), and a step size h
, and returns the numerically computed gradient.
import numpy as np
def loss_function(w):
"""
Example loss function L(w1, w2).
Args:
w: A NumPy array [w1, w2].
Returns:
The scalar value of the loss function.
"""
w1, w2 = w[0], w[1]
term1 = (2*w1 + 3*w2 - 5)**2
term2 = (w1 - w2 - 1)**2
return term1 + term2
def compute_gradient_numerical(func, point, h=1e-5):
"""
Computes the numerical gradient of a function at a given point.
Args:
func: The function to differentiate. Takes a NumPy array as input.
point: A NumPy array representing the point (e.g., [w1, w2]) at which
to compute the gradient.
h: The step size for finite differences.
Returns:
A NumPy array representing the gradient vector.
"""
point = np.asarray(point, dtype=float)
grad = np.zeros_like(point)
# Iterate over each dimension (parameter)
for i in range(len(point)):
# Create points shifted by +h and -h in the i-th dimension
point_plus_h = point.copy()
point_minus_h = point.copy()
point_plus_h[i] += h
point_minus_h[i] -= h
# Compute central difference
partial_derivative = (func(point_plus_h) - func(point_minus_h)) / (2 * h)
grad[i] = partial_derivative
return grad
# Define the point at which to compute the gradient
w_point = np.array([1.0, 1.0])
# Compute the numerical gradient
numerical_gradient = compute_gradient_numerical(loss_function, w_point)
print(f"Point (w1, w2): {w_point}")
print(f"Numerical Gradient at point: {numerical_gradient}")
# For comparison, let's calculate the analytical gradient at (1, 1)
# dL/dw1 = 10*w1 + 10*w2 - 22
# dL/dw2 = 10*w1 + 20*w2 - 28
analytical_gradient = np.array([
10*w_point[0] + 10*w_point[1] - 22,
10*w_point[0] + 20*w_point[1] - 28
])
print(f"Analytical Gradient at point: {analytical_gradient}")
# Calculate the difference (error)
error = np.linalg.norm(numerical_gradient - analytical_gradient)
print(f"Difference between numerical and analytical gradient: {error:.2e}")
Running this code will output:
Point (w1, w2): [1. 1.]
Numerical Gradient at point: [-2. 2.]
Analytical Gradient at point: [-2. 2.]
Difference between numerical and analytical gradient: 1.89e-11
As you can see, the numerical gradient calculated using the central difference method is extremely close to the analytical gradient we derived earlier (which was [−2,2] at (1,1)). The small difference is due to the approximation inherent in finite differences and floating-point arithmetic.
We can visualize the function's surface or its contours and plot the gradient vector at our point (1,1). The gradient vector [−2,2] points in the direction of the steepest ascent from (1,1). In optimization contexts like gradient descent (covered in the next chapter), we typically move in the opposite direction of the gradient to find a minimum.
{"layout": {"title": "Contour Plot of L(w1, w2) with Gradient at (1, 1)", "xaxis": {"title": "w1"}, "yaxis": {"title": "w2"}, "width": 600, "height": 500, "annotations": [{"x": 1, "y": 1, "ax": 1 + (-2*0.05), "ay": 1 + (2*0.05), "xref": "x", "yref": "y", "axref": "x", "ayref": "y", "showarrow": true, "arrowhead": 2, "arrowsize": 1, "arrowwidth": 2, "arrowcolor": "#f03e3e"}]}, "data": [{"type": "contour", "x": [-1.0, -0.5, 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0], "y": [-1.0, -0.5, 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0], "z": [[46.0, 30.25, 18.0, 9.25, 4.0, 2.25, 4.0, 9.25, 18.0], [30.25, 18.0, 9.25, 4.0, 2.25, 4.0, 9.25, 18.0, 30.25], [18.0, 9.25, 4.0, 2.25, 4.0, 9.25, 18.0, 30.25, 46.0], [9.25, 4.0, 2.25, 4.0, 9.25, 18.0, 30.25, 46.0, 65.25], [4.0, 2.25, 4.0, 9.25, 18.0, 30.25, 46.0, 65.25, 88.0], [2.25, 4.0, 9.25, 18.0, 30.25, 46.0, 65.25, 88.0, 114.25], [4.0, 9.25, 18.0, 30.25, 46.0, 65.25, 88.0, 114.25, 144.0], [9.25, 18.0, 30.25, 46.0, 65.25, 88.0, 114.25, 144.0, 177.25], [18.0, 30.25, 46.0, 65.25, 88.0, 114.25, 144.0, 177.25, 214.0]], "colorscale": "Viridis", "contours": {"coloring": "heatmap"}}, {"type": "scatter", "x": [1], "y": [1], "mode": "markers", "marker": {"color": "#f03e3e", "size": 10, "symbol": "x"}, "name": "Point (1, 1)"}]}
Contour plot of the loss function L(w1,w2). The red 'x' marks the point (1,1), and the red arrow indicates the numerically calculated gradient [−2,2]. Notice the arrow points perpendicular to the contour line towards higher function values (brighter colors).
Numerical gradient computation is extremely useful:
However, it has limitations:
h
and floating-point precision.In practice, for training large machine learning models, analytical gradient methods (like backpropagation, which utilizes the chain rule efficiently) are preferred for their speed and precision. Numerical methods often serve as a valuable tool for debugging and verification.
This hands-on exercise demonstrates how the abstract concept of the gradient translates into a concrete computational technique using NumPy, providing a practical tool for analyzing and optimizing multivariable functions common in machine learning.
© 2025 ApX Machine Learning