Changes

← Older edit

Allen's REINFORCE notes

1,760 bytes added, 01:23, 26 May 2024

no edit summary

Suppose we'd like to find <math>\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))</math>. By the chain rule this is equal to <math>\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}</math>. Thus, by rearranging, we can take the gradient of any function with respect to some variable as <math>\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)</math>.

Thus, using this idea, we can rewrite our gradient as <math> \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) </math>.

=== Loss ~~Function~~ Computation ===

~~The goal~~ It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} | s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of ~~REINFORCE~~ probabilities in log space is equal to ~~optimize~~ the ~~expected cumulative~~ sum of the logarithm of each of the probabilities, we get our final gradient expression <math> \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) </math>. Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] </math>. ~~We do so~~ This is why our loss is equal to <math> -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t </math>, since using the chain rule to take its derivative gives us the formula for the gradient ~~descent~~for our backwards pass (see Dennis' Optimization Notes).

Allen12

53

edits

Changes

Allen's REINFORCE notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools