Reinforcement Learning

Agent-Environment Interface

figures/agent-env.pdf

$$ S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots $$

Backup diagram

file:figures/vpi-backup.pdf

Some terms

Each MDP comprises the tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, γ \rangle$
$\mathcal{S}$ is the set of states
$\mathcal{A}$ is the set of actions
$\mathcal{R}$ is the set of rewards
$\mathcal{P}$ is the transition model
$γ$ is a discount factor

What is a state?

You might represent a state with a single number or a vector of numbers
There are a finite number and so they can be enumerated
State is a compact representation of all history
If you could do better knowing the history, then the state does not have the Markov property
We will later come to infinite or continuous states

What is an action?

There are finite actions (infinite or continuous actions later)
At each time step, some action must be taken but that can be a no-op
The effect of the action is determined by the transition model

What is a transition model?

A transition model is a (stochastic) mapping between states, actions, subsequent states, and rewards, $$ p(s’, r | s, a) $$
It represents how the environment “works”

Goals and Rewards

The reward hypothesis

That all of what we mean by goals and purposes can be well thought of as a maximisation of the expected value of the cumulative sum of a received scalar signal (called reward). —Michael Littman (S&B)

Rewards and Episodes

Long term reward and Episodes

We aim to maximise our expected reward $G_t$, which is the sum of all future rewards, $$ G_t \doteq R_t+1 + R_t+2 + R_t+3 + \cdots + R_T $$ ending in the final reward at time $T$.

An episode is everything up to a final time step $T$.

Note that if $T$ were infinite, we would have the potential for infinite long-term reward.

Discounted reward

It is often natural to think of a gain in some distant future as being not so valuable as a gain right now. $$ G_t \doteq R_t+1 + γ R_t+2 + γ^2 R_t+3 + \cdots $$ for $γ<1$.

$$ G_t \doteq ∑0\leq k < ∞ γ^k Rt + k + 1 $$

Note that it is now not necessary to place a finite limit on the episode length.

Unified notation for episodic and continuing tasks

We can deal with the difference between episodic and continuing tasks using the concept of absorbing states
Absorbing states yield zero reward and always transition to the same state

$$ G_t \doteq ∑t+1\leq k \leq T γk-t-1 Rk $$

where $T=∞$ or $γ = 1$ (but not both).

Policies and value functions

Policies

A policy represents how to act in each possible state
Policies are a distribution over actions $$ π(a | s) → [0, 1] $$ For all $s∈ \mathcal{S}$, $$ ∑_a π(a | s) = 1 $$

Value functions

A state value function $v_π(s, a)$ is the long term value of being in state $s$ assuming that you follow policy $π$ $$ v_π(s) \doteq \mathbb{E}_π [G_t | S_t = s] $$
A state action value function $q_π(s,a)$ is the long term value of being in state $s$, taking action $a$, and then following $π$ from then on. $$ q_π(s,a) \doteq \mathbb{E}_π [G_t | S_t = s, A_t=a] $$

How do we select an action?

figures/backup-s-a.pdf

What is the consequence of taking that action?

figures/backup-a-s.pdf

Optimal policies and optimal value functions

Optimal policies and value functions

Given some MDP, what is the best value we can achieve? $$ v_*(s) = max_π v_π (s), $$ for all $s ∈ \mathcal{S}$
What is the best state action value achievable? $$ q_*(s, a) = max_π q_π (s, a), $$ for all $s ∈ \mathcal{S}, a ∈ \mathcal{A}(s)$

Exercise

Develop a recursive expression for $v_*(s)$ and $q_*(s,a)$ from what we know so far
Feel free to look at the book for help

rl-course/mdp.org

Reinforcement Learning

Agent-Environment Interface

Agent-Environment Interface

Backup diagram

Some terms

What is a state?

What is an action?

What is a transition model?

Goals and Rewards

The reward hypothesis

Rewards and Episodes

Long term reward and Episodes

Discounted reward

Unified notation for episodic and continuing tasks

Unified notation for episodic and continuing tasks

Policies and value functions

Policies

Value functions

How do we select an action?

What is the consequence of taking that action?

Optimal policies and optimal value functions

Optimal policies and value functions

Exercise