Reinforcement Learning
Agent-Environment Interface
Agent-Environment Interface
Backup diagram
file:figures/vpi-backup.pdf
Some terms
- Each MDP comprises the tuple
$\langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, γ \rangle$ -
$\mathcal{S}$ is the set of states -
$\mathcal{A}$ is the set of actions -
$\mathcal{R}$ is the set of rewards -
$\mathcal{P}$ is the transition model -
$γ$ is a discount factor
What is a state?
- You might represent a state with a single number or a vector of numbers
- There are a finite number and so they can be enumerated
- State is a compact representation of all history
- If you could do better knowing the history, then the state does not have the Markov property
- We will later come to infinite or continuous states
What is an action?
- There are finite actions (infinite or continuous actions later)
- At each time step, some action must be taken but that can be a no-op
- The effect of the action is determined by the transition model
What is a transition model?
- A transition model is a (stochastic) mapping between states, actions, subsequent states, and rewards, $$ p(s’, r | s, a) $$
- It represents how the environment “works”
Goals and Rewards
The reward hypothesis
That all of what we mean by goals and purposes can be well thought of as a maximisation of the expected value of the cumulative sum of a received scalar signal (called reward). —Michael Littman (S&B)
Rewards and Episodes
Long term reward and Episodes
We aim to maximise our expected reward
An episode is everything up to a final time step
Note that if
Discounted reward
It is often natural to think of a gain in some distant future as being not so valuable as a gain right now.
$$
G_t \doteq Rt+1 + γ Rt+2 + γ^2 Rt+3 + \cdots
$$
for
Note that it is now not necessary to place a finite limit on the episode length.
Unified notation for episodic and continuing tasks
Unified notation for episodic and continuing tasks
- We can deal with the difference between episodic and continuing tasks using the concept of absorbing states
- Absorbing states yield zero reward and always transition to the same state
where
Policies and value functions
Policies
- A policy represents how to act in each possible state
- Policies are a distribution over actions
$$
π(a | s) → [0, 1]
$$
For all
$s∈ \mathcal{S}$ , $$ ∑_a π(a | s) = 1 $$
Value functions
- A state value function
$v_π(s, a)$ is the long term value of being in state$s$ assuming that you follow policy$π$ $$ v_π(s) \doteq \mathbb{E}_π [G_t | S_t = s] $$ - A state action value function
$q_π(s,a)$ is the long term value of being in state$s$ , taking action$a$ , and then following$π$ from then on. $$ q_π(s,a) \doteq \mathbb{E}_π [G_t | S_t = s, A_t=a] $$
How do we select an action?
What is the consequence of taking that action?
Optimal policies and optimal value functions
Optimal policies and value functions
- Given some MDP, what is the best value we can achieve?
$$
v_*(s) = max_π v_π (s),
$$
for all
$s ∈ \mathcal{S}$ - What is the best state action value achievable?
$$
q_*(s, a) = max_π q_π (s, a),
$$
for all
$s ∈ \mathcal{S}, a ∈ \mathcal{A}(s)$
Exercise
- Develop a recursive expression for
$v_*(s)$ and$q_*(s,a)$ from what we know so far - Feel free to look at the book for help