Advantage Function
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A(s, a) = Q(s, a) - V(s) } . Intuitively: extra reward we get if we take action at state compared to the mean reward at that state. We use this advantage function to tell us how good the action is - if its positive, the action is better than others at that state so we want to move in that direction, and if its negative, the action is worse than others at thtat state so we move in the opposite direction.
Motivation
Intuition: Want to avoid too large of a policy update
- Smaller policy updates more likely to converge to optimal
- Falling "off the cliff" might mean it's impossible to recover
How we solve this: Measure how much policy changes w.r.t. previous, clip ratio to Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [1-\varepsilon, 1 + \varepsilon]} removing incentive to go too far.