Training quantum machine learning models, particularly the variational algorithms we'll encounter later, almost always involves a classical optimization loop. While part of the computation happens on a quantum processor (evaluating a cost function or its gradients), a classical computer runs an optimization algorithm to update the parameters of the quantum circuit. Therefore, a solid understanding of classical optimization techniques is essential for building and training effective QML models. This section revisits some significant gradient-based and gradient-free optimization methods often employed in this hybrid quantum-classical setting.
Gradient-Based Optimization
Most workhorse optimizers in machine learning rely on gradients, the direction of steepest ascent (or descent) of the cost function L(θ) with respect to parameters θ. The general update rule is θt+1=θt−η∇θL(θt), where η is the learning rate.
Stochastic Gradient Descent (SGD) and Mini-batches
Calculating the exact gradient often requires summing contributions over an entire dataset, which can be computationally prohibitive. Stochastic Gradient Descent (SGD) addresses this by estimating the gradient using only a small, randomly selected subset of the data, called a mini-batch, at each step.
θt+1=θt−η∇θL(θt;x(i:i+b),y(i:i+b))
Here, L(θt;x(i:i+b),y(i:i+b)) is the loss computed on a mini-batch of size b starting at index i. The stochastic nature of the gradient estimate introduces noise, which can sometimes help the optimizer escape shallow local minima but also causes fluctuations around the optimum.
Momentum Methods
Simple SGD can be slow, especially in areas where the cost surface curves more steeply in one dimension than another (forming long, narrow valleys). Momentum methods aim to accelerate convergence and dampen oscillations by adding a fraction γ of the previous update vector to the current one, simulating physical momentum.
The update rules are:
vt=γvt−1+η∇θL(θt)θt+1=θt−vt
Typically, γ is around 0.9. This helps the optimizer continue moving in the prevailing direction and smooths out variations caused by the stochastic gradient estimates. Nesterov Accelerated Gradient (NAG) is a refinement where the gradient is computed at a point projected forward using the current momentum, offering potentially better convergence.
Adaptive Learning Rate Methods
Tuning the learning rate η can be challenging. Adaptive methods adjust the learning rate automatically during training, often on a per-parameter basis.
AdaGrad (Adaptive Gradient): Accumulates the sum of squared past gradients for each parameter and scales the learning rate inversely proportional to this sum. It performs larger updates for infrequent parameters and smaller updates for frequent ones. However, the accumulated sum grows continuously, causing the learning rate to eventually become infinitesimally small, potentially stopping learning prematurely.
RMSProp (Root Mean Square Propagation): Modifies AdaGrad by using an exponentially decaying average of squared gradients instead of the full sum. This prevents the learning rate from monotonically decreasing.
Adam (Adaptive Moment Estimation): Perhaps the most popular adaptive optimizer currently, Adam combines the ideas of momentum and RMSProp. It maintains exponentially decaying averages of both past gradients (mt, first moment estimate) and past squared gradients (vt, second moment estimate).
The update steps for Adam are:
Compute gradient: gt=∇θL(θt)
Update biased first moment estimate: mt=β1mt−1+(1−β1)gt
Update biased second raw moment estimate: vt=β2vt−1+(1−β2)gt2
Compute bias-corrected first moment estimate: m^t=mt/(1−β1t)
Compute bias-corrected second raw moment estimate: v^t=vt/(1−β2t)
Update parameters: θt+1=θt−ηv^t+ϵm^t
Common values are β1=0.9, β2=0.999, and ϵ=10−8. Adam often works well with default settings and is frequently used for training both classical deep neural networks and variational quantum circuits.
Conceptual illustration of optimization paths for SGD, Momentum, and Adam on a sample cost surface. Adaptive methods like Adam often converge more directly towards the minimum.
Second-Order Methods
While gradient descent uses the first derivative (gradient), second-order methods also incorporate the second derivative (Hessian matrix, H), which contains information about the curvature of the cost surface.
Newton's Method: The update rule is θt+1=θt−ηH−1∇θL(θt). By using curvature information, Newton's method can converge much faster than gradient descent near an optimum. However, computing, storing, and inverting the Hessian matrix is computationally expensive (O(d3) for inversion, where d is the number of parameters), making it impractical for models with many parameters.
Quasi-Newton Methods (e.g., BFGS, L-BFGS): These methods approximate the inverse Hessian matrix using information from gradient changes over successive steps. L-BFGS (Limited-memory BFGS) is particularly popular as it only stores a limited history of gradients and updates, making it more memory-efficient than standard BFGS while often retaining good convergence properties. These are sometimes viable for optimizing VQAs if gradients are accessible and the number of parameters is moderate.
Gradient-Free Optimization
In some QML scenarios, calculating gradients can be challenging or impossible. This might be due to the nature of the cost function, the presence of significant noise on quantum hardware making gradient estimation unreliable, or simply because the computational cost of gradient calculation (e.g., via the parameter-shift rule) is too high. In such cases, gradient-free (or derivative-free) optimization methods become necessary.
Simultaneous Perturbation Stochastic Approximation (SPSA): SPSA is particularly well-suited for optimization problems with noisy function evaluations, a common situation when running algorithms on near-term quantum computers. Instead of calculating the full gradient (requiring 2d measurements for finite differences or parameter shift), SPSA estimates the gradient using only two measurements, regardless of the dimension d. It does this by simultaneously perturbing all parameters using a random direction vector Δk (where each element is typically drawn from a Bernoulli distribution, e.g., ±1). The gradient estimate is formed using:
g^k(θk)=2ckL(θk+ckΔk)−L(θk−ckΔk)Δk−1
where ck is a gain sequence that decreases over iterations. Despite being a noisier estimate, its low measurement cost makes it attractive for optimizing VQAs on hardware.
Nelder-Mead: This method uses a simplex (a geometric shape with d+1 vertices in d dimensions) and iteratively modifies it through reflections, expansions, and contractions based on the function values at the vertices, trying to move the simplex towards the minimum. It's a heuristic method with no strong convergence guarantees but often works well in practice for low-dimensional problems.
COBYLA (Constrained Optimization BY Linear Approximation): Another derivative-free method that approximates the objective function and constraints using linear interpolations within a trust region.
Optimization Challenges in the QML Context
Optimizing QML models presents unique challenges compared to classical ML:
Hardware Noise: Evaluating the cost function L(θ) on quantum hardware inevitably involves noise (decoherence, gate errors, readout errors). This noise can corrupt the cost value and any gradients estimated from it, potentially stalling or misleading the optimizer. Robustness to noise is a significant consideration, favouring methods like SGD or SPSA which inherently handle stochasticity, or requiring sophisticated error mitigation techniques (discussed in Chapter 7).
Barren Plateaus: For many common choices of Parameterized Quantum Circuits (PQCs), particularly deep or global ones, the gradient of the cost function vanishes exponentially with the number of qubits. This "barren plateau" phenomenon makes gradient-based optimization extremely difficult, as the landscape becomes flat almost everywhere. Strategies to mitigate this include careful ansatz design, specific initialization, or using gradient-free methods (though these are not immune). This is covered in detail in Chapter 4.
Complex Cost Landscapes: The cost functions derived from quantum circuits can be highly non-convex, featuring numerous local minima. Finding the global minimum, or even a good local minimum, requires effective optimization strategies and often careful hyperparameter tuning (learning rates, momentum parameters, SPSA gain sequences).
Measurement Cost: Obtaining accurate expectation values (required for the cost function and gradients) from a quantum computer requires repeated measurements (shots). This statistical estimation cost adds another layer to the optimization process, often involving a trade-off between measurement accuracy and optimization speed.
While classical optimization algorithms provide the engine for training, their successful application in QML requires careful consideration of these quantum-specific factors. The choice between gradient-based methods like Adam and gradient-free methods like SPSA often depends on the specific problem, the PQC architecture, the available hardware, and the noise levels involved. Understanding these classical tools is the first step towards navigating the practicalities of building and training advanced quantum machine learning models.