Consider a policy improvement :

\[V_{\pi_{new}} \geq V_{\pi_{old}}\]

Instead of “direct” policy gradient, we do approximately optimal PG.

does the algo. guarantee improvement?

how much (measure) will it improve?

local approximation

One can easily come up with “conservative” update; improve the policy, but also preserve old policy. unlikely states which might be globally good

mixture of old policy $\pi_{old}$ and possible policy $\pi’$ :

\[\pi_{new} = (1 - \alpha) \pi_{old} + \alpha \pi'\]

you can rewrite as :

\[\pi_{new} = \pi_{old} + \alpha (\pi' - \pi_{old})\]

update the old policy toward some policy with strength of alpha

결국, you want to measure the distance between pi1 and pi2

stochastic policy, total variation to KL

importance sampling

In Kakade et al. a lower bound for this mixture policy is provided.

instead of direct PG, we maximize this lower bound, also called surrogate loss. monotonic improvement

visitation frequency

\[\begin{align} \eta(\tilde{\pi}) &= \eta(\pi) + \sum_{s} \rho^{\textcolor{red}{\tilde{\pi}}}(s) \sum_{a} \tilde{\pi}(a \mid s) A^{\pi}(s, a) \\ L_{\pi}(\tilde{\pi}) &= \eta(\pi) + \sum_{s} \rho^{\textcolor{red}{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A^{\pi}(s, a) \\ \end{align}\] \[\eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi}) + O(\alpha ^ 2)\] \[D_{TV}^2 \leq D_{KL}\]

Importance sampling

KL, TRPO

clip, PPO


References