[Editing] Most of the deep RL algorithms are TD

Updates :

Monte Carlo : $S_t \mapsto G_t$,

TD(0) : $S_t \mapsto R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t)$

$n$-step TD : $S_t \mapsto G_{t:t+n}$

DP : $s \mapsto \mathbb{E}_\pi \left[ R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) \mid S_t = s \right]$

An operator is generally a mapping of function to function.

\[TV = R + \gamma PV\]

Bellman operator $T$ maps (updates) $V_t$ to $V_{t+1}$ by solving the recursive relationship.

As a relief, under complete metric space, applying $T$ repeatedly converges to unique solution at linear rate of $\gamma$, i.e. contraction mapping theorem, $||T(u)-T(v)||_{\infty} \leq \gamma ||u-v||_{\infty}$.

(Model-free, temporal difference, sample-based)

Temporal difference (TD) – literally, “temporal” difference between $t+1$ and $t$.

\[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ \textcolor{red}{r_{t+1} + \gamma \max_a Q(s_{t+1}, a)} - Q(s_t, a_t) \right]\]

Analogously to “target” or label in supervised learning, TD “target” serves as a (semi-)supervision in Q learning.

TD target $r_{t+1} + \gamma \max_a Q(s_{t+1}, a)$ consists of : actual signal (label) $r_{t+1}$, estimation $\gamma \max_a Q(s_{t+1}, a)$.

Intuitively, Q function is updated by the error = (label + estimation) - (estimation); This is called “bootstrapping”; you estimate based on your estimation. c.f. Monte-Carlo method.

I was once obsessed with reducing the volume of $|S \times A|$, specifically $|A|$, but was it correct approach?