Now that we've theoretically explored Stochastic Variance Reduced Gradient (SVRG) as a potent technique to combat the high variance inherent in standard SGD, let's translate that theory into practice. This section provides a hands-on guide to implementing SVRG, allowing you to observe its behavior and understand the practical considerations involved. We'll focus on a clear implementation to highlight the core mechanics of the algorithm.
Recall the core idea from our discussion on SVRG: periodically compute a full gradient over the entire dataset to serve as a reference point. Then, in subsequent stochastic steps, adjust the standard stochastic gradient by comparing the gradient of the same data point evaluated at the current parameters versus the reference parameters. This adjustment significantly reduces the variance of the gradient estimate, often leading to faster convergence compared to SGD, especially when measured in terms of passes over the data.
Let's outline the steps involved in a typical SVRG implementation for minimizing a function f(w)=n1∑i=1nfi(w), where fi(w) is the loss for data point i:
Let's implement SVRG to solve a simple least-squares linear regression problem. Our objective function for data (X,y) where X∈Rn×d and y∈Rn is:
f(w)=2n1∥Xw−y∥22=n1i=1∑n21(xiTw−yi)2Here, fi(w)=21(xiTw−yi)2. The gradient for a single data point (xi,yi) is:
∇fi(w)=(xiTw−yi)xiAnd the full gradient is:
∇f(w)=n1XT(Xw−y)Below is a Python implementation using NumPy.
import numpy as np
def linear_loss(w, X, y):
"""Calculates the mean squared error loss."""
n_samples = X.shape[0]
predictions = X.dot(w)
loss = (1 / (2 * n_samples)) * np.sum((predictions - y)**2)
return loss
def full_gradient(w, X, y):
"""Calculates the full gradient for linear regression."""
n_samples = X.shape[0]
predictions = X.dot(w)
gradient = (1 / n_samples) * X.T.dot(predictions - y)
return gradient
def stochastic_gradient(w, X, y, index):
"""Calculates the stochastic gradient for a single data point."""
x_i = X[index]
y_i = y[index]
prediction_i = x_i.dot(w)
# Reshape x_i to ensure correct dot product dimensions if needed
if x_i.ndim == 1:
x_i = x_i[:, np.newaxis] # Make it a column vector if it's flat
# Calculate gradient: (prediction - actual) * feature_vector
# Ensure the scalar (prediction_i - y_i) multiplies the vector x_i correctly
# The result should have the same shape as w
gradient_i = (prediction_i - y_i) * x_i.flatten() # Flatten x_i back if necessary or ensure result shape matches w
# Ensure gradient_i has the same shape as w (e.g., (d,))
if w.ndim == 1 and gradient_i.shape != w.shape:
gradient_i = gradient_i.reshape(w.shape)
elif w.ndim > 1 and gradient_i.shape != w.shape: # e.g. w is (d, 1)
gradient_i = gradient_i.reshape(w.shape)
return gradient_i
def svrg_optimizer(X, y, n_outer_epochs, m, learning_rate):
"""Implements the SVRG algorithm for linear regression."""
n_samples, n_features = X.shape
# Ensure y is a 1D array or a column vector consistent with predictions
if y.ndim == 1:
y = y # Keep as 1D array
elif y.ndim == 2 and y.shape[1] == 1:
y = y.flatten() # Make it 1D for easier calculations with predictions
else:
raise ValueError("y must be a 1D array or a column vector")
# Initialize parameters (e.g., zeros)
w = np.zeros(n_features)
# If you prefer column vector parameters:
# w = np.zeros((n_features, 1))
loss_history = []
print(f"Starting SVRG: {n_outer_epochs} outer epochs, m={m}, lr={learning_rate}")
for s in range(n_outer_epochs):
# 1. Snapshot and Full Gradient Calculation
w_snapshot = np.copy(w)
mu = full_gradient(w_snapshot, X, y)
# Store current loss (calculated with the snapshot before inner loop)
current_loss = linear_loss(w_snapshot, X, y)
loss_history.append(current_loss)
if s % 10 == 0 or s == n_outer_epochs - 1:
print(f"Outer Epoch {s+1}/{n_outer_epochs}, Loss: {current_loss:.6f}")
# 2. Inner Loop
w_inner = np.copy(w_snapshot) # Start inner loop from snapshot
for t in range(m):
# Select random sample
i_t = np.random.randint(0, n_samples)
# Calculate gradients
grad_current = stochastic_gradient(w_inner, X, y, i_t)
grad_snapshot = stochastic_gradient(w_snapshot, X, y, i_t)
# SVRG update
w_inner = w_inner - learning_rate * (grad_current - grad_snapshot + mu)
# Update outer loop parameters
w = np.copy(w_inner) # Use the result of the inner loop
print("SVRG Finished.")
return w, loss_history
# --- Example Usage ---
# Generate synthetic data
np.random.seed(42)
n_samples, n_features = 1000, 10
X = np.random.randn(n_samples, n_features)
true_w = np.random.randn(n_features) * 5
y = X.dot(true_w) + np.random.randn(n_samples) * 0.5 # Add some noise
# SVRG parameters
n_outer_epochs = 50 # Number of times we compute the full gradient
m = n_samples # Number of inner loop steps (1 effective pass per outer epoch)
learning_rate = 0.1
# Run SVRG
w_svrg, svrg_loss_history = svrg_optimizer(X, y, n_outer_epochs, m, learning_rate)
print("\nLearned weights (SVRG):", w_svrg)
To appreciate the behavior of SVRG, it's instructive to compare its convergence profile against standard SGD. You would typically implement a similar loop for SGD (updating using only ∇fit(wt)) and run it for an equivalent number of gradient computations or effective data passes.
Let's visualize the potential difference in convergence speed. We'll plot the loss against the number of effective passes through the data. For SVRG, each outer epoch involves one full gradient calculation (equivalent to one pass) plus m stochastic gradient calculations. If m=n, each outer epoch corresponds to roughly two effective passes over the data (one for the full gradient, one effective pass for the n stochastic updates). For standard SGD, n stochastic updates constitute one effective pass.
{"data":[{"type":"scatter","mode":"lines","name":"SVRG","x":[i*2 for i in range(len(svrg_loss_history))],"y":svrg_loss_history,"line":{"color":"#1c7ed6"}}, {"type": "scatter", "mode": "lines", "name": "SGD (Illustrative)", "x": [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100], "y": [15, 8, 5, 3.5, 2.8, 2.4, 2.1, 1.9, 1.8, 1.75, 1.7], "line": {"color": "#f03e3e", "dash": "dot"}}],"layout":{"title":"SVRG vs. SGD Convergence (Illustrative)","xaxis":{"title":"Effective Data Passes"},"yaxis":{"title":"Loss (MSE)","type":"log"},"legend":{"x":0.7,"y":0.98}}}
Illustrative comparison of loss reduction per effective data pass for SVRG and a typical SGD run. SVRG often exhibits faster convergence, especially on a log scale for the loss. Note that the SGD curve is manually created for illustration. Run your own SGD implementation for a direct comparison on the same data.
m
: The number of inner loop iterations, m, is a significant hyperparameter.
This practical implementation should provide a solid foundation for applying SVRG to larger, more complex problems. Remember that careful tuning of m and η is often required to achieve the best performance. Experimenting with SVRG on your own datasets and comparing it to SGD is the best way to gain intuition about its strengths and practical behavior.
© 2025 ApX Machine Learning