The classical Bellman optimality operator uses one-step lookahead:
$$ \mathcal{T} v = \max_{a} \left{ r(s,a) + \gamma \mathbb{E}[v(s') \mid s,a] \right}. $$
This operator is a
This raises a natural question: might we obtain better convergence properties by modifying the Bellman operator itself? Rather than one-step lookahead, consider multi-step operators:
$$ \mathcal{T}^{(n)} v = \max_{a_0} \mathbb{E}\left[ \sum_{t=0}^{n-1} \gamma^t r_t + \gamma^n v(s_n) ,\Big|, s_0 = s, a_0 \right], $$
where the expectation is over trajectories following optimal actions at states
For policy evaluation with a fixed policy
$$ \mathcal{T}_\pi^{(n)} v = \mathbb{E}\pi\left[ \sum{t=0}^{n-1} \gamma^t r_t + \gamma^n v(s_n) ,\Big|, s_0 = s \right]. $$
These multi-step operators define different fixed-point equations. Just as $v^* = \mathcal{T} v^$, we have $v^ = \mathcal{T}^{(n)} v^*$ for any
Rather than committing to a single horizon
The weight
Why geometric weights? Several justifications:
-
Maximum entropy: Among all distributions on
${1, 2, 3, \ldots}$ with a given mean, the geometric distribution has maximum entropy. This makes it the "least informative" choice given a desired average depth. -
Memoryless property: The geometric distribution is memoryless:
$P(N > n+k \mid N > n) = P(N > k)$ . This time-consistency simplifies analysis. -
Resolvent interpretation: The series
$(1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} \mathcal{T}^{(n)}$ resembles a Neumann series or resolvent expansion from functional analysis. -
Computational tractability: As we will see, geometric averaging enables recursive computation via auxiliary state variables (eligibility traces), avoiding explicit summation.
An equivalent perspective interprets
$$ \mathcal{T}^\lambda v = \mathbb{E}_\tau\left[ \mathcal{T}^{(\tau)} v \right]. $$
The operator averages over random-depth backups. This viewpoint appears in {cite}Watkins1989 and {cite}DayanSejnowski1994, motivated by biological considerations of memory traces in reinforcement learning. While the historical motivation was psychological, the mathematical structure stands on its own: we are simply considering a family of Bellman-like operators parameterized by
For the Bellman optimality operator, the multi-step backup must account for the max at each intermediate state:
$$ \mathcal{T}^{(n)} v = \max_{a_0} \mathbb{E}\left[ \sum_{t=0}^{n-1} \gamma^t r(s_t, a_t) + \gamma^n v(s_n) ,\Big|, s_0 = s, a_0, a_t = \arg\max_{a'} [r(s_t,a') + \gamma \mathbb{E}[v(s_{t+1}) \mid s_t, a']] \right]. $$
Each
$$ \mathcal{T}^\lambda v = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} \max_{a_0} \mathbb{E}{\text{greedy}}\left[ \sum{t=0}^{n-1} \gamma^t r_t + \gamma^n v(s_n) ,\Big|, s_0=s, a_0 \right]. $$
Key observation: The max appears only at the initial action
Just as with the standard Bellman operator
Outer level (infinite-dimensional): We seek a solution to the functional equation
$$
v = \mathcal{T}^\lambda v.
$$
Since we cannot represent
Inner level (finite-dimensional): We parameterize
For different choices of test functions, we obtain different finite-dimensional conditions:
Collocation: Choose
Galerkin: Require orthogonality
Least squares: Minimize the weighted squared residual: $$ \min_\theta \int_{\mathcal{S}} [R^\lambda(s; \theta)]^2 w(s) ds = \min_\theta \left| v_\theta - \mathcal{T}^\lambda v_\theta \right|_w^2. $$
Each method reduces the infinite-dimensional functional equation to a system of
The most natural approach exploits the fixed-point structure. In function space, we would iterate: $$ v_{k+1} = \mathcal{T}^\lambda v_k. $$
But since we work in a finite-dimensional subspace, each iterate must be projected back. This gives projected fixed-point iteration:
$$
v_{k+1} = \Pi \mathcal{T}^\lambda v_k,
$$
where
In coefficient space, this becomes a two-step procedure:
Step 1 (Evaluate): Given current coefficients
Step 2 (Fit): Find new coefficients
The specifics of Step 2 depend on the projection method:
-
Collocation: Solve
$\boldsymbol{\Phi} \theta^{(k+1)} = t^{(k)}$ where$\Phi_{ij} = \varphi_j(s_i)$ -
Galerkin: Solve
$M\theta^{(k+1)} = b^{(k)}$ where$M_{ij} = \langle \varphi_i, \varphi_j \rangle_w$ and$b_i^{(k)} = \langle t^{(k)}, \varphi_i \rangle_w$ -
Least squares: Solve
$\min_\theta \sum_i w_i (t_i^{(k)} - \boldsymbol{\varphi}(s_i)^\top \theta)^2$
This is fitted value iteration with the
:label: fitted-vi-lambda-collocation
**Input** Collocation points $\{s_1, \ldots, s_n\}$, basis functions $\{\varphi_1, \ldots, \varphi_n\}$, compound parameter $\lambda \in [0,1)$, initial $\theta^{(0)}$, tolerance $\varepsilon > 0$
**Output** Converged coefficients $\theta^*$
1. Form collocation matrix $\boldsymbol{\Phi}$ with $\Phi_{ij} = \varphi_j(s_i)$
2. $k \leftarrow 0$
3. **repeat**
1. **for** each $i = 1, \ldots, n$ **do** (Evaluate $\mathcal{T}^\lambda v_{\theta^{(k)}}$ at collocation points)
1. $t_i^{(k)} \leftarrow [\mathcal{T}^\lambda v_{\theta^{(k)}}](s_i)$
(This is $(1-\lambda)\sum_{m=1}^{\infty} \lambda^{m-1} [\mathcal{T}^{(m)} v_{\theta^{(k)}}](s_i)$, computed as described below)
2. Solve $\boldsymbol{\Phi} \theta^{(k+1)} = t^{(k)}$ (Fit: interpolate the targets)
3. $k \leftarrow k + 1$
4. **until** $\|\theta^{(k)} - \theta^{(k-1)}\| < \varepsilon$
5. **return** $\theta^{(k)}$
Key observation: This algorithm has the same structure as standard fitted value iteration, but uses
For Galerkin projection, the algorithm structure is similar, but Step 3.2 requires solving a weighted least-squares problem rather than exact interpolation:
:label: fitted-vi-lambda-galerkin
**Input** Basis functions $\{\varphi_1, \ldots, \varphi_n\}$, weight function $w(s)$, compound parameter $\lambda \in [0,1)$, initial $\theta^{(0)}$, tolerance $\varepsilon > 0$
**Output** Converged coefficients $\theta^*$
1. Compute mass matrix $M_{ij} = \int_{\mathcal{S}} \varphi_i(s) \varphi_j(s) w(s) ds$
2. $k \leftarrow 0$
3. **repeat**
1. **for** each $i = 1, \ldots, n$ **do** (Evaluate projected $\mathcal{T}^\lambda v_{\theta^{(k)}}$)
1. $b_i^{(k)} \leftarrow \int_{\mathcal{S}} [\mathcal{T}^\lambda v_{\theta^{(k)}}](s) \varphi_i(s) w(s) ds$
2. Solve $M \theta^{(k+1)} = b^{(k)}$ (Fit: Galerkin projection)
3. $k \leftarrow k + 1$
4. **until** $\|\theta^{(k)} - \theta^{(k-1)}\| < \varepsilon$
5. **return** $\theta^{(k)}$
In both algorithms, convergence depends on whether
Alternatively, we can treat the projection conditions as a rootfinding problem
- Collocation: $G_i(\theta) = v_{\theta}(s_i) - \mathcal{T}^\lambda v_\theta$
-
Galerkin:
$G_i(\theta) = \langle v_\theta - \mathcal{T}^\lambda v_\theta, \varphi_i \rangle_w$
Newton's method then iterates:
$$
\theta^{(k+1)} = \theta^{(k)} - J_G(\theta^{(k)})^{-1} G(\theta^{(k)}),
$$
requiring computation of the Jacobian
The motivation is entirely about the inner level (finite-dimensional) convergence properties. The standard theory guarantees that
- Monotone projections preserve contraction in
$|\cdot|_\infty$ - Orthogonal projections preserve contraction in
$|\cdot|_\xi$ for policy evaluation when$\xi$ is the stationary distribution - For general projections (e.g., Galerkin with polynomial bases), neither guarantee applies
The compound operator
Convergence question: Under what conditions does the successive approximation
$$
\theta^{(k+1)} = [\text{fit to } {(s_i, \mathcal{T}^\lambda v_{\theta^{(k)}})}]
$$
converge? This requires
- For tabular representations (
$\Pi = I$ ),$\lambda$ makes no difference:$v^* = \mathcal{T}^\lambda v^*$ for any$\lambda$ - For policy evaluation with linear function approximation and on-policy sampling, TD(
$\lambda$ ) converges for all$\lambda \in [0,1]$ {cite}TsitsiklisVanRoy1997 - For Bellman optimality with general projections, convergence is not guaranteed even with
$\lambda$ . Divergence examples exist {cite}Baird1995
The weighted residual framework makes clear what we are solving: the fixed-point equation
Computing $\mathcal{T}^\lambda v$: The Role of Recursive Filtering
Having established the two-level iteration structure, we now address a critical implementation question: How do we compute $\mathcal{T}^\lambda v_\theta$ in Step 3.1 of the fitted value iteration algorithms?
The definition
Important conceptual point: The recursive filtering machinery described below is purely about implementing the operator evaluation $\mathcal{T}^\lambda v$ efficiently. It does not change the two-level structure of fitted value iteration. Whether we compute $\mathcal{T}^\lambda v$ via the explicit sum, via recursion, or via any other method, the outer iterate
This is analogous to numerical quadrature: when fitted value iteration with
Consider the
The
We can expand this sum:
\begin{align} G^\lambda_t &= (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} \left( \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n v_\theta(s_{t+n}) \right) \ &= (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} \sum_{k=0}^{n-1} \gamma^k r_{t+k} + (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} \gamma^n v_\theta(s_{t+n}). \end{align}
The second term simplifies:
Telescoping approach: Write $G^{(n+1)}t = r_t + \gamma G^{(n)}{t+1}$. Then:
\begin{align} G^\lambda_t &= (1-\lambda) G^{(1)}t + (1-\lambda) \sum{n=1}^{\infty} \lambda^n G^{(n+1)}t \ &= (1-\lambda) [r_t + \gamma v\theta(s_{t+1})] + (1-\lambda) \sum_{n=1}^{\infty} \lambda^n [r_t + \gamma G^{(n)}{t+1}] \ &= (1-\lambda) r_t + (1-\lambda)\lambda \sum{n=1}^{\infty} \lambda^{n-1} [r_t + \gamma G^{(n)}{t+1}] + (1-\lambda)\gamma v\theta(s_{t+1}) \ &= r_t + \gamma \left[ (1-\lambda) v_\theta(s_{t+1}) + \lambda G^\lambda_{t+1} \right]. \end{align}
This gives the recursion:
Interpretation: The
We can view this recursion as a linear time-invariant (LTI) dynamical system. Define the auxiliary variable (eligibility trace):
where
The trace satisfies a first-order recursion:
This is the backward view: we maintain
The gradient update for the
State-space form: Define state
\begin{align} z_{t+1} &= \lambda\gamma z_t + u_t, \ y_t &= \delta_t z_t. \end{align}
This is a first-order LTI system with state transition matrix
Alternatively, we can view the eligibility trace computation as a discrete convolution. The trace at time
where
This filter has a single pole at
We can now see the complete hierarchy of abstraction in weighted residual methods with compound operators:
Level 1 (Outer/Functional):
Level 2 (Inner/Parametric):
After choosing a finite-dimensional approximation
-
Strategy A (Successive Approximation): $$\theta^{(k+1)} = \arg\min_\theta \sum_i w_i \left( v_\theta(s_i) - \mathcal{T}^\lambda v_{\theta^{(k)}} \right)^2$$ This is fitted value iteration: evaluate operator at current parameters, fit new parameters to targets. Converges when
$\Pi \mathcal{T}^\lambda$ is a contraction. -
Strategy B (Newton's Method): Treat
$G(\theta) = v_\theta - \Pi \mathcal{T}^\lambda v_\theta = 0$ as a rootfinding problem and iterate:$$\theta^{(k+1)} = \theta^{(k)} - J_G(\theta^{(k)})^{-1} G(\theta^{(k)})$$ This is equivalent to policy iteration in the Q-function case. Requires computing Jacobians (Envelope Theorem for optimization operators).
Level 3 (Implementation/Computational): How do we compute $\mathcal{T}^\lambda v_{\theta^{(k)}}$ in the evaluation step?
-
If we have exact MDP knowledge: Compute expectations via dynamic programming: $$\mathcal{T}^{(n)} v = \max_a \left{ r(s,a) + \gamma \sum_{s'} p(s'|s,a) \mathcal{T}^{(n-1)} v \right}$$ Then form the geometric average $(1-\lambda)\sum_{n=1}^{N} \lambda^{n-1} \mathcal{T}^{(n)} v$ by truncating at large
$N$ . -
If we only have a simulator: Sample trajectories from
$s$ , compute returns$G^\lambda_t$ along trajectories, and average. The eligibility trace recursion provides memory-efficient computation:$$z_{t+1} = \lambda\gamma z_t + \nabla v_\theta(s_t), \quad \nabla_\theta L = \mathbb{E}\left[\sum_t \delta_t z_t\right]$$
The three levels are cleanly separated:
- Level 1: What equation are we solving? (functional equation, operator choice)
- Level 2: How do we reduce it to finite dimensions and solve the resulting system? (projection method, successive approximation vs rootfinding)
- Level 3: How do we evaluate the operator given our computational constraints? (exact DP, Monte Carlo, eligibility traces)
The eligibility trace filter lives entirely at Level 3—it is a computational device for Monte Carlo estimation, analogous to choosing Gauss-Hermite quadrature vs Monte Carlo for computing
Why this hierarchy matters:
-
Conceptual clarity: We can reason about convergence at Level 2 (does
$\Pi \mathcal{T}^\lambda$ contract?) independently of Level 3 concerns (how do we compute it?). - Modularity: We can swap Level 3 implementations (exact DP ↔ Monte Carlo ↔ importance sampling) without changing Levels 1-2.
- Avoiding confusion: Eligibility traces are often presented as if they define the algorithm. In fact, they're an implementation optimization for a particular case (Monte Carlo estimation at Level 3).
To make the two-level structure concrete, consider policy evaluation with a fixed policy
Outer level: We seek
Inner level (Galerkin projection): We parameterize as
Expanding this condition: $$ \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi} \boldsymbol{\theta} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathcal{T}^\lambda_\pi v_\theta). $$
For successive approximation (fitted value iteration at Level 2), we iterate: $$ \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi} \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathcal{T}^\lambda_\pi v_{\theta^{(k)}}). $$
This is precisely the LSTD(
Level 3 (Implementation): How do we compute the right-hand side
-
With exact MDP knowledge: Sum over states and compute expectations: $$[\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathcal{T}^\lambda_\pi v)]_i = \sum_s \xi(s) \varphi_i(s) \mathcal{T}^\lambda_\pi v$$
-
With only a simulator: Sample trajectories following
$\pi$ , compute$\lambda$ -returns along each trajectory, and use eligibility traces to efficiently accumulate the sum$\sum_t \delta_t z_t$ where$z_t$ is the filtered gradient history.
The two-level structure (outer functional equation, inner coefficient iteration) is independent of whether we have a model or only samples. The sampling concern is purely at Level 3.
For the Bellman optimality operator with linear Q-function approximation
Outer level: We seek $Q^$ satisfying $Q^ = \mathcal{T}^\lambda_Q Q^*$, where:
$$
\mathcal{T}^{(n)}_Q Q = \mathbb{E}\left[ \sum_{t=0}^{n-1} \gamma^t r_t + \gamma^n \max_{a'} Q(s_n, a') ,\Big|, s_0=s, a_0=a, \pi^Q_{\text{greedy}} \text{ thereafter} \right],
$$
and
Inner level (Collocation): Choose state-action pairs ${(s_i, a_i)}{i=1}^n$ and require: $$ \boldsymbol{\varphi}(s_i, a_i)^\top \boldsymbol{\theta} = [\mathcal{T}^\lambda_Q Q\theta](s_i, a_i), \quad i = 1, \ldots, n. $$
Successive approximation (fitted Q-iteration at Level 2):
$$
\boldsymbol{\Phi} \boldsymbol{\theta}^{(k+1)} = t^{(k)}, \quad t_i^{(k)} = [\mathcal{T}^\lambda_Q Q_{\theta^{(k)}}](s_i, a_i),
$$
where
Level 3 (Implementation): Computing $\mathcal{T}^\lambda_Q Q$ requires:
-
Multi-step returns with greedy action selection: For each
$n$ , compute $$\mathcal{T}^{(n)}_Q Q = \mathbb{E}\left[\sum_{t=0}^{n-1} \gamma^t r_t + \gamma^n \max_{a'} Q(s_n, a') ,\Big|, s_0=s, a_0=a, \text{greedy thereafter}\right]$$ -
Geometric averaging: $\mathcal{T}^\lambda_Q Q = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} \mathcal{T}^{(n)}_Q Q$
With a simulator, we sample trajectories and use eligibility traces to efficiently compute returns. The eligibility trace for Q-learning is: $$ z_t = \sum_{k=0}^{t} (\lambda\gamma)^{t-k} \mathbb{I}[a_k \text{ is greedy}] \nabla Q_\theta(s_k, a_k). $$
The indicator Watkins1989.
Why the cut-off? The multi-step operator
Peng's Q(Peng1993 omits the cut-off:
$$
z_t = \sum_{k=0}^{t} (\lambda\gamma)^{t-k} \nabla Q_\theta(s_k, a_k).
$$
This computes
The fundamental question remains: for which projection operators
- For policy evaluation with linear function approximation and on-policy sampling, convergence holds for all
$\lambda \in [0,1]$ {cite}TsitsiklisVanRoy1997 - For optimality operators, even with linear approximation, convergence is not guaranteed. Divergence examples exist {cite}
Baird1995
A complete theory would characterize:
- When does
$\lambda > 0$ improve contraction rate compared to$\lambda = 0$ ? - Is there an optimal choice of
$\lambda$ for a given projection operator and MDP? - Can we adaptively adjust
$\lambda$ during learning to improve stability?
For policy evaluation with linear function approximation, the iteration
Key question: How does
For Galerkin projection with policy
The compound operator satisfies: \begin{align} A_\lambda &= \Pi_\xi \mathcal{T}^\lambda_\pi = \Pi_\xi \left[(1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} \mathcal{T}^n_\pi\right] \ &= (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} (\Pi_\xi \mathcal{T}\pi)^n \quad \text{(if projection and operator commute)} \ &= (1-\lambda) \sum{n=1}^{\infty} \lambda^{n-1} A_0^n. \end{align}
Problem: Projection and operator generally do not commute: $\Pi_\xi \mathcal{T}\pi \mathcal{T}\pi \neq \Pi_\xi \mathcal{T}\pi \cdot \Pi\xi \mathcal{T}\pi$ in general. The exact relationship between eigenvalues of $A\lambda$ and
However, we can derive bounds. If
Empirical observations from the reinforcement learning literature:
- Intermediate values
$\lambda \in [0.8, 0.95]$ often work best in practice - Very small
$\lambda$ gives high variance (short bootstraps) -
$\lambda = 1$ (Monte Carlo) has zero bias but maximum variance - The optimal
$\lambda$ is problem-dependent and relates to the eigenvalue structure
A complete theory characterizing
The compound operator has a closed-form expression as a Neumann series sum. For policy evaluation (linear operator), assuming
This is a resolvent-like operator. In functional analysis, the resolvent
Preconditioning interpretation: In iterative methods for linear systems
The analogy is imperfect because we're not solving a linear system directly—we're finding a fixed point. But the intuition carries over: we hope that $(1-\lambda)\mathcal{T}\pi(I - \lambda\mathcal{T}\pi)^{-1}$ has better contraction properties than
Connection to relaxation methods: In numerical linear algebra, successive over-relaxation (SOR) accelerates iterative solvers by using updates of the form
Another perspective on why
Consider the residual at iteration
The error propagates through the transition operator
With the compound operator: $$ e^{(k+1)} = v^* - \Pi\mathcal{T}^\lambda v^{(k)} = v^* - \Pi (1-\lambda)\sum_n \lambda^{n-1} \mathcal{T}^n v^{(k)}. $$
The error now depends on an exponentially weighted average of
This geometric intuition suggests
We used geometric weights
- Truncated sums:
$\mathcal{T}^{[N]} = N^{-1} \sum_{n=1}^{N} \mathcal{T}^{(n)}$ (uniform average up to horizon$N$ ) - Polynomial weightings:
$w_n = c n^{-\alpha}$ for heavy tails - Data-adaptive weightings: learn weights from observed variance
These lose the recursive structure but might have better theoretical properties. Implementing them requires storing trajectory segments, trading computation for potentially better approximation.
The state-space representation of eligibility traces connects to control theory. Eligibility is a controlled dynamical system where the control input is the gradient sequence. Tools from control theory might apply:
- Observability/controllability analysis
- Optimal filtering (Kalman-like) for noisy gradient estimates
- Stability analysis via Lyapunov functions
Can we design better "filters" than the exponential decay
would allow richer temporal dynamics. This increases state dimension but might better match the temporal structure of value propagation in specific MDPs.
The discrete-time geometric stopping naturally extends to continuous-time exponential stopping. Define an exponentially distributed stopping time
$$ \mathcal{T}^\lambda v = \mathbb{E}\tau\left[ \int_0^\tau e^{-\rho t} r(s_t, a_t) dt + e^{-\rho \tau} v(s\tau) \right], $$
where
This is a first-order linear ODE. Do the convergence properties differ in continuous time? Are there advantages to working with continuous formulations (e.g., for robot control with fast sampling rates)?
Having developed the theory, we now consider practical guidance for implementing weighted residual methods with compound operators.
Use multi-step operators ($\lambda \in (0.5, 0.95)$) when:
- The projected operator
$\Pi \mathcal{T}$ contracts slowly or not at all (high discount factor$\gamma$ , non-monotone projection) - You have access to trajectory data (simulation or real experience)
- The variance-bias tradeoff favors some bias reduction (noisy rewards, sparse features)
Stick with one-step (
- The projection is monotone (linear interpolation, state aggregation) and
$\gamma$ is moderate - You only have single-transition samples (no trajectories)
- Off-policy learning is required (multi-step methods complicate importance sampling)
Consider Monte Carlo (
- Episodes are short (low variance)
- Bootstrapping is unreliable (poor initial value function, very sparse features)
- Theoretical simplicity is valued (unbiased returns, no bootstrap error)
There is no universal optimal
Grid search: Try
Adaptive
- Start with large
$\lambda$ (more Monte Carlo) for exploration, decay toward small$\lambda$ (more bootstrapping) as value function improves - Use different
$\lambda$ for different states (learned via meta-gradient methods)
Cross-validation: For episodic tasks, use held-out episodes to measure prediction error at different
Theory-guided: If you can estimate the spectrum of
Model-based (Level 3: Exact DP):
- Compute $\mathcal{T}^{(n)} v$ recursively up to horizon
$N$ where$\lambda^N$ is negligible - Storage:
$O(|\mathcal{S}| \cdot N)$ for intermediate$n$ -step values - Useful when transitions are cheap to evaluate (tabular or low-dimensional problems with analytic models)
Model-free (Level 3: Monte Carlo + Eligibility):
- Sample trajectories, maintain eligibility trace
$z_t = \lambda\gamma z_{t-1} + \nabla v_\theta(s_t)$ - Storage:
$O(|\theta|)$ for the trace (constant per trajectory length) - Essential when only a simulator is available
- Implementation: Use trace accumulation for LSTD(
$\lambda$ ) or online TD($\lambda$ )
Hybrid:
- Use short exact DP for
$n \leq n_0$ (cheap), then Monte Carlo with traces for$n > n_0$ (expensive) - Useful when partial model is available (e.g., reward function known, transitions require simulation)
Recall that at Level 2 (parametric/coefficient space), we can use either:
Successive approximation (fitted value iteration):
- Simple to implement: evaluate operator, fit, repeat
- Requires
$\Pi \mathcal{T}^\lambda$ to be a contraction - Works well with monotone projections or on-policy Galerkin
Newton's method (policy iteration):
- Faster convergence near solution (quadratic vs. linear)
- Requires Jacobian computation (Envelope Theorem for max operators)
- More robust to weak contraction or non-monotone projections
- Equivalent to policy iteration for Q-functions
Hybrid strategy:
- Run a few iterations of fitted value iteration to get into the basin of attraction
- Switch to Newton's method for rapid final convergence
- Useful when contraction is weak (
$\gamma$ close to 1) or projection is non-monotone
When implementing TD(
Level 1 (Operator choice):
- Are you solving policy evaluation (
$\mathcal{T}_\pi$ ) or optimality ($\mathcal{T}_Q$ )? - What is the target operator? (standard one-step or
$\lambda$ -compound) - For Q(
$\lambda$ ): Watkins (greedy trajectory) or Peng (behavior trajectory)?
Level 2 (Projection and solution method):
- What is the approximation architecture? (linear features, neural network, etc.)
- What projection method? (collocation, Galerkin, least-squares)
- What solution strategy? (successive approximation, Newton, stochastic gradient)
- Do you have a convergence guarantee? (monotone projection, matched weighting, etc.)
Level 3 (Evaluation implementation):
- Model-based or model-free?
- If model-free: full trajectories or single transitions?
- If using traces: correct decay rate
$\lambda\gamma$ ? Correct reset on episode boundaries? - For Q(
$\lambda$ ): correct greedy cutoff implementation?
We have developed compound Bellman operators
-
The three-level hierarchy cleanly separates:
- What equation we solve (functional equation, operator choice)
- How we reduce it to finite dimensions (projection, successive approximation vs. Newton)
- Implementation details (exact DP, Monte Carlo, eligibility traces)
-
Eligibility traces are not fundamental to the formulation—they are an efficient implementation technique (Level 3) for Monte Carlo estimation when we lack a model.
-
The forward/backward view distinction is purely computational (Level 3), not conceptual. Both views compute the same
$\lambda$ -return, just reorganized for different computational constraints. -
The motivation for
$\lambda$ is about improving finite-dimensional convergence (Level 2): even if$\Pi \mathcal{T}$ fails to contract,$\Pi \mathcal{T}^\lambda$ might contract for appropriate$\lambda$ . This is about the parametric iteration structure, not about psychological intuitions regarding memory traces. -
For Bellman optimality (Q(
$\lambda$ )), the multi-step operator naturally involves greedy action selection at intermediate states. The eligibility cutoff in Watkins's Q($\lambda$ ) follows directly from this definition, not from algorithmic considerations. -
Open theoretical questions remain about when
$\Pi \mathcal{T}^\lambda$ contracts, how to choose$\lambda$ optimally, and connections to resolvent theory and operator preconditioning.
This perspective integrates naturally with the main chapter's development:
- We previously asked: when does
$\Pi \mathcal{T}$ (composition of projection with Bellman operator) contract? - Now we ask: can we modify the operator via
$\lambda$ -averaging to improve contraction properties? - Both questions live at the Level 2 (parametric iteration) of the hierarchy
The machinery of weighted residual methods (test functions, orthogonality conditions, Galerkin vs. collocation) applies identically whether we use
We have extended the weighted residual framework to compound Bellman operators
| Level | What it specifies | Example for TD( |
|---|---|---|
| 1. Outer (Functional) | Which infinite-dimensional equation to solve |
|
| 2. Inner (Parametric) | How to reduce to finite dimensions and solve | Galerkin: Successive approx: |
| 3. Implementation | How to compute operator evaluations | Exact: $\mathcal{T}^\lambda v = (1-\lambda)\sum_n \lambda^{n-1}\mathcal{T}^n v$ Sampled: Eligibility traces |
| Algorithm | Level 1 (Operator) | Level 2 (Projection) | Level 3 (Evaluation) |
|---|---|---|---|
| Fitted VI | Collocation or Galerkin Successive approximation |
Exact DP or MC | |
| LSTD | $v_\pi = \mathcal{T}\pi v\pi$ | Galerkin with Closed-form solution |
Exact (model-based) |
| TD( |
Galerkin with Stochastic gradient |
MC + eligibility traces | |
| LSTD( |
Galerkin with Closed-form solution |
MC + trace-based matrix estimation | |
|
Q( |
Collocation Successive approximation |
MC + eligibility with greedy cutoff | |
|
Q( |
Collocation Successive approximation |
MC + eligibility (no cutoff) |
-
Motivation for
$\lambda$ : The parameter$\lambda$ provides an additional degree of freedom at Level 1. Even if$\Pi \mathcal{T}$ fails to contract, perhaps$\Pi \mathcal{T}^\lambda$ does for appropriate$\lambda$ . This is about improving finite-dimensional convergence (Level 2), not changing the underlying optimal value function (which satisfies$v^* = \mathcal{T}^\lambda v^*$ for any$\lambda$ ). -
Eligibility traces are Level 3: The forward/backward view distinction, state-space models, and filtering are purely about efficient Monte Carlo implementation. They don't change what equation we're solving (Level 1) or the structure of coefficient iteration (Level 2).
-
Watkins vs Peng: The difference is at Level 1 (which operator). Watkins solves the Bellman optimality equation with multi-step greedy backups. Peng solves the behavior policy equation. The eligibility cutoff follows from the operator definition, not from algorithmic considerations.
-
Convergence is a Level 2 question: Whether successive approximation
$\theta^{(k+1)} = \text{fit}[\mathcal{T}^\lambda v_{\theta^{(k)}}]$ converges depends on whether$\Pi \mathcal{T}^\lambda$ is a contraction. This is independent of whether we use exact DP or Monte Carlo (Level 3). The theory remains incomplete:- For policy evaluation with on-policy sampling and linear FA, TD(
$\lambda$ ) converges for all$\lambda \in [0,1]$ - For Bellman optimality with general projections, even with
$\lambda$ , divergence examples exist - The spectral properties of
$\Pi \mathcal{T}^\lambda$ and optimal choice of$\lambda$ are not fully understood
- For policy evaluation with on-policy sampling and linear FA, TD(
-
Connection to resolvent theory: The compound operator
$(1-\lambda)\sum_n \lambda^{n-1}\mathcal{T}^n = (1-\lambda)(I - \lambda\mathcal{T})^{-1}\mathcal{T}$ resembles a resolvent or preconditioned operator. Can ideas from numerical linear algebra (preconditioning to improve condition numbers) provide insight into when$\Pi \mathcal{T}^\lambda$ has better contraction properties than$\Pi \mathcal{T}$ ?
The traditional presentation of TD(
- Eligibility traces are presented as if they are the algorithm, obscuring that they're an implementation optimization
- The forward/backward view equivalence is emphasized, but this is purely a Level 3 computational identity
- The role of
$\lambda$ in improving projected operator contraction (Level 2 concern) is rarely made explicit
By separating the levels, we gain:
- Conceptual clarity: What problem are we solving vs. how are we solving it vs. how do we implement it?
- Modularity: Can swap implementations at each level independently
- Research directions: Can separately study operator properties (Level 1), contraction theory (Level 2), and efficient estimation (Level 3)
This perspective also extends naturally to modern function approximation (neural networks, kernels) and connects to classical numerical analysis (operator preconditioning, resolvent theory), avoiding the folklore narrative based on psychological intuitions about memory traces.