Bandit Problems and MDP Intro


A markov process is a stochastic process that is memoryless. That means that the future only depends on the current state, not on the history.

Markov Reward Processes

A markov reward process has States: Finite set of states Rewards: Rs is the expecation of awards for a given state State Transition Prob Matrix: Probability of moving to the other states given a current state Discount Factor: How much to weigh current rewards vs future ones. In range [0,1].

Markov Decision Processes

A markov decision process is an extension of the MRP where the agent now has a set of actions. Actions: Finite set of actions States: Finite set of states Rewards: Rs is the expecation of awards for a given state State Transition Prob Matrix: Probability of moving to the other states given a current state and current action Discount Factor: How much to weigh current rewards vs future ones. In range [0,1].


A policy’s value function assigns a value to each state or to each state, action pair. There is one optimal value function but many possible optimal policies