Open main menu

Humanoid Robots Wiki β

Changes

Allen's Reinforcement Learning Notes

43 bytes added, 23:55, 24 May 2024
no edit summary
Consider a problem where we have to train a robot to pick up some object. A traditional ML algorithm might try to learn some function f(x) = y, where given some position x observed via the camera we output some behavior y. The trouble is that in the real world, the correct grab location is some function of the object and the physical environment, which is hard to intuitively ascertain by observation.
The motivation behind reinforcement learning is to repeatedly take observations, then sample the effects of actions on those observations (reward and new observation/state). Ultimately, we hope to create a policy <math>pi </math> that maps states or observations to optimal actions.
=== Learning ===
=== State vs. Observation ===
A state is a complete representation of the physical world while the observation is some subset or representation of s. They are not necessarily the same in that we can't always infer know what s_t is from o_t, but o_t is inferable from s_t. To think of it as a network of conditional probability, we have
* <math> s_1 -> o_1 - (pi_theta\pi_\theta) -> a_1 </math> (policy)
* <math> s_1, a_1 - (p(s_{t+1} | s_t, a_t) -> s_2 </math> (dynamics)
 
 
Note that theta represents the parameters of the policy (for example, the parameters of a neural network). Assumption: Markov Property - Future states are independent of past states given present states. This is the fundamental difference between states and observations.
Idea 1: Policy iteration - if we have a policy <math> \pi </math> and we know <math> Q^pi (s, a) </math>, we can improve the policy, by deterministically setting the action at each state be the argmax of all possible actions at the state.
<math> Q_iQ_{i+1} (s,a)=(1−1 - \alpha)Q_i(s,a)+\alpha(r(s, a)+\gammaV_igamma V_i(s')) </math>
Idea 2: Gradient update - If <math> Q^pi(s, a) > V^pi(s) </math>, then a is better than average. We will then modify the policy to increase the probability of a.
53
edits