Allen's REINFORCE notes

Recall that the objective of Reinforcement Learning is to find an optimal policy Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi^* } which we encode in a neural network with parameters $\theta ^{*}$ . These optimal parameters are defined as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] } . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau } ) determined by the policy is the highest over all policies.

Overview

Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions. Remember a policy is a mapping from observations to outputs. If the space is continuous, it may make more sense to make output be one mean and one standard deviation for each component of the action.
Repeat:

State vs. Observation

Allen's REINFORCE notes

Contents

Links

Motivation

Overview

State vs. Observation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools