Model Free RL¶
In most real world use cases we do not know the dynamics of the environment so we can’t use a model based reinforcement learning approach. Model free RL get’s around this by learning from episodes of experience.
Monte Carlo Learning¶
Learn from episodes of experience either directly with the environment or with a simulater
Update the value of states by average the total future reward of that state. By law of large numbers this should converge to the true value
You can focus Monte Carlo learning on a small subset of states rather than learning all possible states.
Advantage of Monte Carlo learning is that they may be less harmed by violations to the Markov property. They don’t bootstrap
Monte Carlo Control¶
You can start with even likilhood of all actions at each state
After running through episodes you can update your policy based on your estimate of future rewards for all states.
Maintaining sufficient exploration is an issue with Monte Carlo learning/control. If you follow a greedy policy you are not gaurunteed to explore all states.
\( V\left( s_{t}\right) \leftarrow V\left( s_{t}\right) +\dfrac{1}{N\left( s_{t}\right) }\left( G_{t}-V\left( s_{t}\right) \right) \)
On and Off Policy Control¶
Normally you are learning based on episodic experience from your own policy, and then iterating on that policy.
In many cases you may have data collected using some policy that wasn’t your own
Off policy control is using data from another policy to update your policy.
Importance Sampling: The samples are weighted by the ratio of the two policies choosing that action.
TD Learning¶
For TD learning we don’t perform a “Full Backup” on the data. We update the value of states only using the next state.