Cost Functions in QML | Classification & Regression
Cost Functions for QML Tasks
In the Variational Quantum Algorithm (VQA) framework, the parameterized quantum circuit (PQC) prepares a quantum state ∣ψ(θ)⟩, where θ represents the classical parameters we aim to optimize. However, to guide this optimization, we need a way to quantify how well the state ∣ψ(θ)⟩ achieves the target machine learning objective. This is the role of the cost function, C(θ). It acts as the essential bridge between the quantum processing unit (QPU) and the classical optimization routine, translating quantum measurement outcomes into a scalar value that indicates performance.
From Quantum Measurements to Classical Costs
At the core of a VQA's evaluation step lies quantum measurement. Typically, we measure one or more qubits in a specific basis (often the computational basis, Z-basis) after the state ∣ψ(θ)⟩ has been prepared. This measurement process is inherently probabilistic, yielding classical bitstrings as outcomes.
To form a differentiable cost function suitable for optimization, we usually don't work directly with the raw probabilities of these bitstrings. Instead, we compute the expectation value of a chosen observable, which is a Hermitian operator O^. For a given state ∣ψ(θ)⟩ prepared by the PQC with parameters θ, the expectation value is calculated as:
⟨O^⟩θ=⟨ψ(θ)∣O^∣ψ(θ)⟩
This expectation value ⟨O^⟩θ provides a smooth, real-valued output that depends on the circuit parameters θ. It represents the average value we would obtain if we measured the observable O^ on the state ∣ψ(θ)⟩ many times. This expectation value, or a function derived from it and the target data labels, forms the basis of our cost function C(θ).
The overall process can be visualized as follows:
The VQA loop involves preparing a state with the PQC using current parameters θ, measuring an observable O^ to estimate its expectation value ⟨O^⟩θ, calculating the cost C(θ) by comparing this value to the target, and feeding the cost back to a classical optimizer to propose updated parameters.
Designing Cost Functions for Specific ML Tasks
The exact form of C(θ) depends significantly on the machine learning task at hand. Let's look at common scenarios:
Classification
In classification tasks, the goal is to assign input data xi to one of several discrete categories.
Binary Classification: For problems with two classes (e.g., labeled +1 and -1), a common approach is to design the PQC and observable O^ such that the expectation value ⟨O^⟩θ(i) (computed for input xi) correlates with the class label yi. For instance, we might measure the Pauli Z operator Z^0 on the first qubit, aiming for ⟨Z^0⟩θ(i)≈+1 for one class and ⟨Z^0⟩θ(i)≈−1 for the other.
Mean Squared Error (MSE): A straightforward cost function is the MSE between the predicted expectation value and the target label:
CMSE(θ)=N1i=1∑N(⟨O^⟩θ(i)−yi)2
Here, N is the number of training samples. While simple, MSE penalizes deviations quadratically, which might not always be the most effective loss for classification.
Hinge Loss: Inspired by Support Vector Machines (SVMs), the hinge loss encourages the expectation value to be correctly signed and above a certain margin:
CHinge(θ)=N1i=1∑Nmax(0,1−yi⟨O^⟩θ(i))
This loss is zero if the prediction ⟨O^⟩θ(i) has the correct sign (yi⟨O^⟩θ(i)≥1) and increases linearly otherwise.
Cross-Entropy Loss: This loss is common in classical ML for probabilistic classifiers. It can be adapted if we interpret measurement outcomes as probabilities. For example, if we measure qubit 0 and estimate the probability p(+1∣θ,xi) of measuring ∣0⟩ (associated with label yi=1) and p(−1∣θ,xi) of measuring ∣1⟩ (associated with label yi=0), the binary cross-entropy is:
CXEnt(θ)=−N1i=1∑N[yilogp(+1∣θ,xi)+(1−yi)logp(−1∣θ,xi)]
Using cross-entropy requires careful estimation of probabilities from measurement counts, which can be sample-intensive. It often requires mapping the binary labels {0,1} or {−1,+1} appropriately to the probabilities derived from specific measurement outcomes.
Multi-class Classification: Extending these ideas to more than two classes typically involves using multiple output qubits, different measurement strategies (e.g., measuring multiple Pauli operators), or combining binary classifiers. The cost function needs to be adapted accordingly, often using multi-class versions of MSE or cross-entropy.
Regression
In regression, the goal is to predict a continuous value yi for an input xi.
The expectation value ⟨O^⟩θ(i) itself can serve as the predicted continuous output. The scaling and range of the observable O^ should ideally match the expected range of the target values yi, or the output needs to be rescaled.
Mean Squared Error (MSE): This is the most common cost function for regression in VQAs:
CMSE(θ)=N1i=1∑N(⟨O^⟩θ(i)−yi)2
It directly penalizes the squared difference between the quantum model's prediction and the true continuous value.
Generative Modeling
Cost functions for generative tasks, such as learning probability distributions with Quantum Circuit Born Machines (QCBMs) or training Quantum Generative Adversarial Networks (QGANs), are distinct. They often involve metrics that compare the probability distribution produced by the quantum circuit to the target data distribution (e.g., Maximum Mean Discrepancy, Kullback-Leibler divergence estimates) or adversarial losses derived from a discriminator network. These will be explored in more detail in Chapter 6.
Choosing the Right Observable
The choice of the observable O^ is not arbitrary; it's an integral part of the VQA design. It determines precisely what property of the final quantum state ∣ψ(θ)⟩ is extracted to make the prediction. Common choices include:
Single-Qubit Pauli Operators:Z^k, X^k, or Y^k acting on a specific qubit k. Measuring Z^k is natural as it corresponds to measurement in the computational basis.
Multi-Qubit Pauli Strings: Tensor products like Z^0⊗Z^1 or X^0⊗Y^1. These capture correlations between qubits.
Weighted Sums of Pauli Strings (Hamiltonians): More complex observables, often motivated by the problem structure, like H^=∑kckP^k where P^k are Pauli strings and ck are coefficients.
The observable should be chosen based on how the PQC encodes information. If the final answer is expected to be encoded in the polarization of a specific qubit, measuring a Pauli operator on that qubit makes sense. If correlations are important, a multi-qubit observable might be necessary. The range of the expectation value (e.g., [−1,+1] for Pauli operators) should also be considered when relating it to target labels or values.
Practical Estimation of Expectation Values
It's important to remember that on any real quantum computer (or simulator mimicking one), we cannot access the exact expectation value ⟨O^⟩θ. Instead, we estimate it by preparing the state ∣ψ(θ)⟩ and measuring the observable O^ repeatedly, say Nshots times.
If O^ is diagonal in the measurement basis (like Z^ in the computational basis), we count the frequencies of the outcomes corresponding to its eigenvalues.
If O^ is not diagonal (like X^ or Y^), we need to apply appropriate basis change gates before measurement or decompose O^ into a sum of simpler observables (like Pauli strings) that can be measured individually.
The accuracy of this estimation depends on the number of shots, with the standard deviation of the estimate typically scaling as 1/Nshots. This inherent statistical noise, known as "shot noise," means the cost function evaluation itself is noisy, adding a layer of challenge to the optimization process compared to classical ML where function evaluations are typically deterministic.
Impact on Optimization
The cost function C(θ) defines the optimization surface that the optimizer finds to find the best parameters θ. The structure of this surface, influenced by the PQC architecture, the data encoding, the choice of observable, and the specific cost function formula, determines the feasibility and efficiency of training the VQA. Issues like the presence of many local minima or the phenomenon of barren plateaus (regions where gradients vanish exponentially with the number of qubits) are directly related to the properties of C(θ) and its gradients. Understanding how to define effective cost functions is therefore fundamental to building successful VQAs for machine learning. The subsequent sections on gradient calculation and optimization techniques will build directly upon this foundation.
Was this section helpful?
A variational eigenvalue solver on a photonic quantum processor, Alberto Peruzzo, Jeremy McClean, Peter Shadbolt, M. H. Yung, Xiao-Qi Zhou, Peter J. Love, Alán Aspuru-Guzik, Jeremy L. O'Brien, 2014Nature Communications, Vol. 5DOI: 10.1038/ncomms5213 - Introduces the Variational Quantum Eigensolver (VQE), a foundational algorithm that uses a classical optimizer to minimize a quantum expectation value, establishing the VQA framework.
Supervised Learning with Quantum Computers, Maria Schuld, Francesco Petruccione, 2018 (Springer)DOI: 10.1007/978-3-319-96424-9 - A comprehensive textbook on quantum machine learning, detailing how cost functions are constructed from quantum measurements for various supervised learning tasks.
Barren plateaus in quantum neural network training landscapes, Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush, Hartmut Neven, 2018Nature Communications, Vol. 9DOI: 10.1038/s41467-018-07090-4 - Identifies the phenomenon of barren plateaus in variational quantum algorithms, a significant challenge where cost function gradients vanish exponentially, hindering optimization.
Quantum machine learning, Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, Seth Lloyd, 2017Nature, Vol. 549DOI: 10.1038/nature23474 - A widely cited review providing an overview of quantum machine learning paradigms and algorithms, including the role of cost functions in quantum learning tasks.