Model Based RL¶

For model based RL we need to have full knowledge of all states and the dynamics of the environment: the state transition matrix. Dynamic programming methods operate on sweeps of the states performing a full backup. That means a state is updated based on all possible futures states, the rewards, and their probability of occuring

Bootstrapping: Using estimates of the value to update other estimates of value until the system stabilizes at certain values

Full backups are related to Belleman equations. When convergence has occured the Bellman optimality equation has been satisfied.

Policy Evaluation¶

Iterative evaluation of an environment by updating the value function of states with the expectation of the value function of the next states see this notebookthis notebook for example implementation

Policy Iteration¶

The policy can be updated based on the new value function across states. One way to update is using the greedy policy. It can be proven that updating witht he greedy policy will give you a policy that is at least as good as the old policy see this notebookthis notebook for example implementation

Value Iteration¶

Iterate on q(s, a)
Need to store more values that just using value iteration see this notebookthis notebook for example implementation

Columbia MS EE Notes

Model Based RL¶

Policy Evaluation¶

Policy Iteration¶

Value Iteration¶

Bellman Expectation Equations¶