Open main menu

Humanoid Robots Wiki β

Difference between revisions of "Allen's REINFORCE notes"

(Motivation)
 
(32 intermediate revisions by the same user not shown)
Line 4: Line 4:
  
 
* [http://www.incompleteideas.net/book/RLbook2020.pdf RLbook2020]
 
* [http://www.incompleteideas.net/book/RLbook2020.pdf RLbook2020]
 +
* [https://samuelebolotta.medium.com/2-deep-reinforcement-learning-policy-gradients-5a416a99700a Deep RL: Policy Gradients]
  
 
[[Category:Reinforcement Learning]]
 
[[Category:Reinforcement Learning]]
Line 9: Line 10:
 
=== Motivation ===
 
=== Motivation ===
  
Recall that the objective of Reinforcement Learning is to find an optimal policy <math>\pi^*<\math> which we encode in a neural network with parameters <math>\theta^*</math>. These optimal parameters are defined as
+
Recall that the objective of Reinforcement Learning is to find an optimal policy <math> \pi^* </math> which we encode in a neural network with parameters <math>\theta^*</math>. <math> \pi_\theta </math> is a mapping from observations to actions. These optimal parameters are defined as
<math>\theta^* = \text<argmax>_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>
+
<math>\theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] </math>. Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (<math> \tau </math>) determined by the policy is the highest over all policies.
  
=== Learning ===
+
=== Overview ===
  
Learning involves the agent taking actions and the environment returning a new state and reward.
+
‎<syntaxhighlight lang="bash" >
* Input: <math>s_t</math>: States at each time step
+
Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
* Output: <math>a_t</math>: Actions at each time step
+
For each episode:
* Data: <math>(s_1, a_1, r_1, ... , s_T, a_T, r_T)</math>
+
  While not terminated:
* Learn <math>\pi_\theta : s_t -> a_t </math> to maximize <math> \sum_t r_t </math>
+
    Get observation from environment
 +
    Use policy network to map observation to action distribution
 +
    Randomly sample one action from action distribution
 +
    Compute logarithmic probability of that action occurring
 +
    Step environment using action and store reward
 +
  Calculate loss over entire trajectory as function of probabilities and rewards
 +
  Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
 +
  Based on the loss, use a gradient descent policy to update weights
  
=== State vs. Observation ===
+
</syntaxhighlight>
 +
 
 +
=== Objective Function ===
 +
 
 +
The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use <math>R(\tau)</math> to denote the total reward over some trajectory <math>\tau</math> defined by our policy. Thus we want to maximize <math>E_{\tau \sim \pi_\theta}[R(\tau)]</math>. We can use the definition of expected value to expand this as <math>\sum_\tau P(\tau | \theta) R (\tau)</math>, where the probability of a given trajectory occurring can further be expressed as <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>.
 +
 
 +
Now we want to find the gradient of <math> J (\theta) </math>, namely
 +
<math>\nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) </math>. Since the reward function isn't a dependent on the parameters. We can rearrange: <math> \sum_\tau R(\tau) \nabla_\theta  P(\tau | \theta) </math>. The next step here is what's called the Log Derivative Trick.
 +
 
 +
Suppose we'd like to find <math>\nabla_{x_1}\log(f(x_1, x_2, x_3, ...))</math>. By the chain rule this is equal to <math>\frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}</math>. Thus, by rearranging, we can take the gradient of any function with respect to some variable as <math>\nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)</math>.
 +
 
 +
Thus, using this idea, we can rewrite our gradient as <math> \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) </math>.
 +
 
 +
=== Loss Computation ===
 +
 
 +
It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by <math> \tau </math> to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that <math> R(\tau) </math> is the total reward over all timesteps. Thus, we can rewrite the <math> R(\tau) </math> component at some timestep t as <math> \gamma^{T - t}r_t </math>, where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is <math> P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) </math>. Since the probabilities of <math> P(s_0) </math> and <math> P(s_{t+a} | s_t, a,t) </math> are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities, we get our final gradient expression <math> \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) </math>.
 +
 
 +
Rewriting this into an expectation, we have <math> \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] </math>. Using the formula for discounted reward, we have our final formula <math> E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] </math>. This is why our loss is equal to <math> -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t </math>, since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).

Latest revision as of 01:23, 26 May 2024

Allen's REINFORCE notes

Contents

LinksEdit

MotivationEdit

Recall that the objective of Reinforcement Learning is to find an optimal policy Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi^* } which we encode in a neural network with parameters Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^*} . Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi_\theta } is a mapping from observations to actions. These optimal parameters are defined as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] } . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau } ) determined by the policy is the highest over all policies.

OverviewEdit

Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
For each episode:
  While not terminated:
    Get observation from environment
    Use policy network to map observation to action distribution
    Randomly sample one action from action distribution
    Compute logarithmic probability of that action occurring
    Step environment using action and store reward
  Calculate loss over entire trajectory as function of probabilities and rewards
  Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
  Based on the loss, use a gradient descent policy to update weights

Objective FunctionEdit

The goal of reinforcement learning is to maximize the expected reward over the entire episode. We use Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R(\tau)} to denote the total reward over some trajectory Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau} defined by our policy. Thus we want to maximize Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E_{\tau \sim \pi_\theta}[R(\tau)]} . We can use the definition of expected value to expand this as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_\tau P(\tau | \theta) R (\tau)} , where the probability of a given trajectory occurring can further be expressed as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) } .

Now we want to find the gradient of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle J (\theta) } , namely Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \nabla_\theta \sum_\tau P(\tau | \theta) R(\tau) } . Since the reward function isn't a dependent on the parameters. We can rearrange: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_\tau R(\tau) \nabla_\theta P(\tau | \theta) } . The next step here is what's called the Log Derivative Trick.

Suppose we'd like to find Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \nabla_{x_1}\log(f(x_1, x_2, x_3, ...))} . By the chain rule this is equal to Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{\nabla_{x_1}f(x_1, x_2, x_3 ...)}{f(x_1, x_2, x_3 ...)}} . Thus, by rearranging, we can take the gradient of any function with respect to some variable as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \nabla_{x_1}f(x_1, x_2, x_3, ...)= f(x_1, x_2, x_3,...)\nabla_{x_1}\log(f(x_1, x_2, x_3, ...)} .

Thus, using this idea, we can rewrite our gradient as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_\tau R(\tau) p(\tau | \theta) \nabla_\theta \log P(\tau | \theta) } .

Loss ComputationEdit

It is tricky for us to give our policy the notion of "total" reward and "total" probability. Thus, we desire to change these values parameterized by Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau } to instead be parameterized by t. That is, instead of examining the behavior of the entire episode, we want to create a summation over timesteps. We know that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R(\tau) } is the total reward over all timesteps. Thus, we can rewrite the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R(\tau) } component at some timestep t as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma^{T - t}r_t } , where gamma is our discount factor. Further, we recall that the probability of the trajectory occurring given the policy is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(\tau | \theta) = P(s_0) \prod^T_{t=0} \pi_\theta(a_t | s_t) P(s_{t + 1} | s_t, a_t) } . Since the probabilities of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(s_0) } and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(s_{t+a} | s_t, a,t) } are determined by the environment and independent of the policy, their gradient is zero. Recognizing this, and further recognizing that multiplication of probabilities in log space is equal to the sum of the logarithm of each of the probabilities, we get our final gradient expression Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_\tau P(\tau | \theta) R( \tau) \sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) } .

Rewriting this into an expectation, we have Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \nabla_\theta J (\theta) = E_{\tau \sim \pi_\theta}\left[R(\tau)\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t)\right] } . Using the formula for discounted reward, we have our final formula Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E_{\tau \sim \pi_\theta}\left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t \right] } . This is why our loss is equal to Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle -\sum_{t = 0}^T \log \pi_\theta (a_t | s_t) \gamma^{T - t}r_t } , since using the chain rule to take its derivative gives us the formula for the gradient for our backwards pass (see Dennis' Optimization Notes).