[Editing] From TRPO to PPO
Consider a policy improvement :
\[V_{\pi_{new}} \geq V_{\pi_{old}}\]Instead of “direct” policy gradient, we do approximately optimal PG.
does the algo. guarantee improvement?
how much (measure) will it improve?
local approximation
One can easily come up with “conservative” update; improve the policy, but also preserve old policy. unlikely states which might be globally good
mixture of old policy $\pi_{old}$ and possible policy $\pi’$ :
\[\pi_{new} = (1 - \alpha) \pi_{old} + \alpha \pi'\]you can rewrite as :
\[\pi_{new} = \pi_{old} + \alpha (\pi' - \pi_{old})\]update the old policy toward some policy with strength of alpha
결국, you want to measure the distance between pi1 and pi2
stochastic policy, total variation to KL
importance sampling
In Kakade et al. a lower bound for this mixture policy is provided.
instead of direct PG, we maximize this lower bound, also called surrogate loss. monotonic improvement
visitation frequency
\[\begin{align} \eta(\tilde{\pi}) &= \eta(\pi) + \sum_{s} \rho^{\textcolor{red}{\tilde{\pi}}}(s) \sum_{a} \tilde{\pi}(a \mid s) A^{\pi}(s, a) \\ L_{\pi}(\tilde{\pi}) &= \eta(\pi) + \sum_{s} \rho^{\textcolor{red}{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A^{\pi}(s, a) \\ \end{align}\] \[\eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi}) + O(\alpha ^ 2)\] \[D_{TV}^2 \leq D_{KL}\]Importance sampling
KL, TRPO
clip, PPO
References