As we navigate the landscape of learning from fixed datasets, one of the early and intuitive approaches adapts familiar concepts from online learning. Fitted Q-Iteration (FQI) extends the core idea behind Q-learning and Value Iteration to the offline, batch setting. Instead of updating Q-values one transition at a time, FQI treats the problem as a sequence of supervised learning tasks, leveraging the entire available dataset D={(si,ai,ri,si′)}i=1N at each step.
The fundamental goal remains the same: estimate the optimal action-value function Q∗(s,a). FQI achieves this iteratively. We start with an initial Q-function estimate, often simply Q^0(s,a)=0 for all s,a. Then, for each iteration k=1,2,…,K:
Generate Targets: Using the Q-function estimate from the previous iteration, Q^k−1, we compute target values yi for each state-action pair (si,ai) present in our static dataset D. The target is derived from the Bellman optimality equation, using the observed reward ri and the estimated maximum value achievable from the next state si′:
yi=ri+γa′maxQ^k−1(si′,a′)This creates a target dataset {((si,ai),yi)}i=1N. Note that the target yi uses the fixed data (ri,si′) but relies on the previous Q-function estimate Q^k−1.
Supervised Learning: We train a function approximator (our new Q-function estimate Q^k) to minimize the difference between its predictions Q^k(si,ai) and the calculated targets yi. This is effectively a standard regression problem over the dataset pairs generated in step 1. The objective is typically the Mean Squared Error (MSE):
Q^k=argQmini=1∑N(Q(si,ai)−yi)2 Q^k=argQmini=1∑N(Q(si,ai)−(ri+γa′maxQ^k−1(si′,a′)))2Repeat: The newly trained Q^k becomes the input for generating targets in the next iteration (k+1). This process repeats for a predetermined number of iterations K, or until the changes in the Q-function estimates become sufficiently small.
A significant aspect of FQI is its flexibility regarding the choice of function approximator for Q(s,a). Since each iteration boils down to a supervised regression task, any suitable regressor can be employed. Historically, non-parametric methods like tree ensembles (Random Forests, Gradient Boosting Trees) were often used due to their effectiveness in capturing complex functions without extensive hyperparameter tuning. However, deep neural networks can also be used, turning FQI into "Neural Fitted Q-Iteration" (NFQ), blurring the lines somewhat with DQN but maintaining the distinct iterative batch-fitting procedure.
Despite its conceptual appeal, FQI often struggles in practice within the offline setting precisely because of distributional shift. The core issue lies in the target calculation step: yi=ri+γmaxa′Q^k−1(si′,a′).
Consider a simple visualization of this issue:
The FQI process relies on estimating the maximum Q-value in the next state (si′) using the previous Q-function (Qk−1). However, the action a′ yielding this maximum might be out-of-distribution relative to the actions present in the dataset D for state si′. This leads to potentially unreliable estimates feeding back into the target calculation, causing error propagation.
FQI, therefore, represents an important baseline but fundamentally lacks mechanisms to explicitly combat the distributional shift problem. It operates under the implicit, often incorrect, assumption that the function approximator will generalize reasonably well even to state-action pairs unseen or rare in the data. This limitation motivates the development of methods specifically designed for the offline setting, such as the policy constraint and value regularization techniques we will discuss next, which aim to mitigate the risks associated with querying OOD actions.
© 2025 ApX Machine Learning