Allen's REINFORCE notes

Recall that the objective of Reinforcement Learning is to find an optimal policy $\pi ^{*}$ which we encode in a neural network with parameters $\theta ^{*}$ . $\pi _{\theta }$ is a mapping from observations to actions. These optimal parameters are defined as $\theta ^{*}={\text{argmax}}_{\theta }E_{\tau \sim p_{\theta }(\tau )}\left[\sum _{t}r(s_{t},a_{t})\right]$ . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory ( $\tau$ ) determined by the policy is the highest over all policies.

Overview

‎

 1 Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
 2 For each episode:
 3   While not terminated:
 4     Get observation from environment
 5     Use policy network to map observation to action distribution
 6     Randomly sample one action from action distribution
 7     Compute logarithmic probability of that action occurring
 8     Step environment using action and store reward
 9   Calculate loss over entire trajectory as function of probabilities and rewards
10   Recall loss functions are differentiable with respect to each parameter - thus, calculate how changes in parameters correlate with changes in the loss
11   Based on the loss, use a gradient descent policy to update weights

Humanoid Robots Wiki ^β

Allen's REINFORCE notes

Contents

Links

Motivation

Overview

Loss Function

Humanoid Robots Wiki β

Allen's REINFORCE notes

Contents

Links

Motivation

Overview

Loss Function

Humanoid Robots Wiki ^β