Difference between revisions of "Allen's REINFORCE notes"
(→Overview) |
|||
| Line 16: | Line 16: | ||
<syntaxhighlight lang="bash" line> | <syntaxhighlight lang="bash" line> | ||
Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions | Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions | ||
| − | For | + | For each episode: |
While not terminated: | While not terminated: | ||
Get observation from environment | Get observation from environment | ||
| Line 24: | Line 24: | ||
Step environment using action and store reward | Step environment using action and store reward | ||
Calculate loss over entire trajectory as function of probabilities and rewards | Calculate loss over entire trajectory as function of probabilities and rewards | ||
| + | Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss | ||
| + | Based on the loss, use a gradient descent policy to update weights | ||
| + | |||
</syntaxhighlight> | </syntaxhighlight> | ||
=== Loss Function === | === Loss Function === | ||
Revision as of 00:15, 25 May 2024
Allen's REINFORCE notes
Contents
Links
Motivation
Recall that the objective of Reinforcement Learning is to find an optimal policy Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi^* } which we encode in a neural network with parameters Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^*} . Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi_\theta } is a mapping from observations to actions. These optimal parameters are defined as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] } . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau } ) determined by the policy is the highest over all policies.
Overview
1 Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
2 For each episode:
3 While not terminated:
4 Get observation from environment
5 Use policy network to map observation to action distribution
6 Randomly sample one action from action distribution
7 Compute logarithmic probability of that action occurring
8 Step environment using action and store reward
9 Calculate loss over entire trajectory as function of probabilities and rewards
10 Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
11 Based on the loss, use a gradient descent policy to update weights