Open main menu

Humanoid Robots Wiki β

Allen's REINFORCE notes

Revision as of 00:05, 25 May 2024 by Allen12 (talk | contribs)

Allen's REINFORCE notes

Contents

Links

Motivation

Recall that the objective of Reinforcement Learning is to find an optimal policy   which we encode in a neural network with parameters Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^*} . Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \pi_\theta } is a mapping from observations to actions. These optimal parameters are defined as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta^* = \text{argmax}_\theta E_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right] } . Let's unpack what this means. To phrase it in english, this is basically saying that the optimal policy is one such that the expected value of the total reward over following a trajectory (Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \tau } ) determined by the policy is the highest over all policies.

Overview

1 Initialize neural network with input dimensions = observation dimensions and output dimensions = action dimensions
2 For # of episodes:
3   While not terminated:
4     Get observation from environment
5     Use policy network to map observation to action distribution
6     Randomly sample one action from action distribution
7     Compute logarithmic probability of that action occurring
8     Step environment using action and store reward
9   Calculate loss over entire trajectory as function of probabilities and rewards

Loss Function