Overview

Overview of Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning that aims to resemble how humans learn. RL agents learn by interact with their agent and modifying their actions based on feedback from their experience.

What makes RL different from other machine learning paradigms

  • No supervisor, only reward signal

  • Feedback is delayed

  • Time matters, sequential decision making

Applications: RL has applications in any problems where the environment is dynamic and where rigid rules based decision making won’t suffice.

Elements of RL

Policy is a mapping from percieved states of the environment to the probability action space \(S\).

\[ s\in S\rightarrow \pi \left( a|s\right)\]

Reward is a mapping from each percieved state of the environment to a single number indicating the intrinsic desirability of the space.

Return is a cumulative sequence of recieved rewards after a given timestep.

Finite state return: $\((G_{t}=R_{t+1}+R_{t+2}+R_{t+3}+\ldots +R_{T})\)$

Discounted Return: $\( 0\leq \Upsilon \leq 1 G_{T}=\sum ^{\infty }_{k=0}\gamma ^{k}R_{t+k+1}\)$

Value Function is an estimate of how good it is to be in a specific state. For a Markov Decision Process (MDP) the value of a state is defined formally as:

\[v_{\pi }\left( s\right) =E_{\pi }\left[ G_{t}|S_{T}=S\right] =E_{\pi }\left[ \sum ^{\infty }_{k:0}\Upsilon ^{k}R_{t+k+1}|S_{t}=S\right]\]

For an MDP the value of an action state pair is

\[q_{\pi }\left( s, a\right) =E_{\pi }\left[ G_{t}|S_{T}=S, A_{T}=a\right] =E_{\pi }\left[ \sum ^{\infty }_{k:0}\Upsilon ^{k}R_{t+k+1}|S_{t}=S, A_{T}=a\right]\]

The Model is used to mimic the environment and is a translation from the current state and desired action to resulting state.

\[P_{ss'}^{a}=P\left[ S_{t+'}=S^{'}|S_{t}=s,A_{t}=a\right]\]