Bandit Problems and MDP Intro¶

Markov¶

A markov process is a stochastic process that is memoryless. That means that the future only depends on the current state, not on the history.

Markov Reward Processes¶

A markov reward process has States: Finite set of states Rewards: Rs is the expecation of awards for a given state State Transition Prob Matrix: Probability of moving to the other states given a current state Discount Factor: How much to weigh current rewards vs future ones. In range [0,1].

Markov Decision Processes¶

A markov decision process is an extension of the MRP where the agent now has a set of actions. Actions: Finite set of actions States: Finite set of states Rewards: Rs is the expecation of awards for a given state State Transition Prob Matrix: Probability of moving to the other states given a current state and current action Discount Factor: How much to weigh current rewards vs future ones. In range [0,1].

Policy¶

A policy’s value function assigns a value to each state or to each state, action pair. There is one optimal value function but many possible optimal policies

Overview

Model Based RL (Dynamic Programming)

Columbia MS EE Notes

Bandit Problems and MDP Intro¶

Markov¶

Markov Reward Processes¶

Markov Decision Processes¶

Policy¶