Difference between revisions of "Allen's REINFORCE notes"
(→Overview) |
|||
| Line 16: | Line 16: | ||
<syntaxhighlight lang="bash" line> | <syntaxhighlight lang="bash" line> | ||
Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions | Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions | ||
| − | For | + | For each episode: |
While not terminated: | While not terminated: | ||
Get observation from environment | Get observation from environment | ||
| Line 24: | Line 24: | ||
Step environment using action and store reward | Step environment using action and store reward | ||
Calculate loss over entire trajectory as function of probabilities and rewards | Calculate loss over entire trajectory as function of probabilities and rewards | ||
| + | Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss | ||
| + | Based on the loss, use a gradient descent policy to update weights | ||
| + | |||
</syntaxhighlight> | </syntaxhighlight> | ||
=== Loss Function === | === Loss Function === | ||
Revision as of 00:15, 25 May 2024
Allen's REINFORCE notes
Contents
Links
Motivation
Recall that the objective of Reinforcement Learning is to find an optimal policy which we encode in a neural network with parameters Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^*} . Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi_\theta } is a mapping from observations to actions. These optimal parameters are defined as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] } . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau } ) determined by the policy is the highest over all policies.
Overview
1 Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
2 For each episode:
3 While not terminated:
4 Get observation from environment
5 Use policy network to map observation to action distribution
6 Randomly sample one action from action distribution
7 Compute logarithmic probability of that action occurring
8 Step environment using action and store reward
9 Calculate loss over entire trajectory as function of probabilities and rewards
10 Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
11 Based on the loss, use a gradient descent policy to update weights