Now that we have explored the mechanics and variations of gradient descent, let's solidify our understanding by implementing the basic algorithm from scratch using Python. This hands-on exercise will help you see the iterative optimization process in action.
We'll apply gradient descent to find the minimum of a simple single-variable function. While real-world machine learning problems involve functions with thousands or millions of variables (parameters), starting with one variable makes it easier to grasp the core update mechanism and visualize the process.
Let's consider the function f(x)=x2−4x+5. This is a parabola opening upwards, so it has a single global minimum.
Our goal is to find the value of x that minimizes f(x) using gradient descent.
First, we need the function itself and its derivative (gradient). The function is: f(x)=x2−4x+5
The derivative (gradient in this 1D case) tells us the slope at any point x: ∇f(x)=f′(x)=dxd(x2−4x+5)=2x−4
We can define these in Python:
import numpy as np
# Define the function
def f(x):
return x**2 - 4*x + 5
# Define the gradient (derivative) of the function
def gradient(x):
return 2*x - 4
The core of gradient descent is the update rule, applied iteratively: xnew=xold−α∇f(xold) where α is the learning rate.
Let's implement this. We need:
def gradient_descent(start_x, learning_rate, n_iterations):
"""
Performs gradient descent to find the minimum of f(x).
Args:
start_x: The initial guess for x.
learning_rate: The step size alpha.
n_iterations: The number of iterations to perform.
Returns:
A tuple containing:
- The final estimated value of x.
- A list of x values at each iteration (history).
"""
x = start_x
history = [x] # Store history for visualization
print(f"Starting Gradient Descent at x = {x:.4f}")
print("Iter | Current x | Gradient | Next x")
print("--------------------------------------")
for i in range(n_iterations):
grad = gradient(x)
next_x = x - learning_rate * grad
print(f"{i+1:4} | {x:9.4f} | {grad:8.4f} | {next_x:9.4f}")
x = next_x
history.append(x)
print("--------------------------------------")
print(f"Gradient Descent finished after {n_iterations} iterations.")
print(f"Estimated minimum occurs at x = {x:.4f}")
print(f"Value of f(x) at minimum: {f(x):.4f}")
return x, history
# --- Set Parameters and Run ---
initial_x = 0.0 # Starting point
alpha = 0.1 # Learning rate
iterations = 25 # Number of steps
final_x, x_history = gradient_descent(initial_x, alpha, iterations)
Running this code will print the progress of the algorithm at each step, showing how the value of x approaches the minimum. You should observe that the gradient gets closer to zero as x approaches the minimum (which we know analytically is at x=2, since 2x−4=0 implies x=2).
Visualizing the process can provide further insight. We can plot the function f(x) and overlay the steps taken by the gradient descent algorithm.
The plot shows the quadratic function f(x) and the path taken by gradient descent starting from x=0 with a learning rate of 0.1. Each marker represents the value of x at an iteration, moving progressively towards the minimum at x=2. The code snippet above manually includes sample values for the path; you would replace these with the
x_history
values returned by thegradient_descent
function and calculate the correspondingy
values usingf(x)
.
Try changing the initial_x
, learning_rate
(alpha), and iterations
. Observe how these changes affect the convergence:
initial_x
: The algorithm should still converge to the same minimum for this convex function.learning_rate
(e.g., 0.8): Might converge faster, but if it's too large (e.g., 1.1 for this function), the algorithm might overshoot the minimum and diverge (the x values will get larger and larger).learning_rate
(e.g., 0.01): Convergence will be slower, requiring more iterations to get close to the minimum.iterations
: The algorithm might stop before reaching the minimum.This simple implementation demonstrates the core concept. In machine learning, f(x) represents the cost function J(θ), x represents the entire set of model parameters θ, and ∇f(x) represents the gradient vector ∇J(θ). The gradient calculation often involves techniques like backpropagation (covered in the next chapter), but the fundamental update step remains the same: adjust parameters in the direction opposite to the gradient to minimize the cost.
© 2025 ApX Machine Learning