Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

Reinforcement Learning

Agent-Environment Interface

Agent-Environment Interface

figures/agent-env.pdf

$$ S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots $$

Backup diagram

file:figures/vpi-backup.pdf

Some terms

  • Each MDP comprises the tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, γ \rangle$
  • $\mathcal{S}$ is the set of states
  • $\mathcal{A}$ is the set of actions
  • $\mathcal{R}$ is the set of rewards
  • $\mathcal{P}$ is the transition model
  • $γ$ is a discount factor

What is a state?

  • You might represent a state with a single number or a vector of numbers
  • There are a finite number and so they can be enumerated
  • State is a compact representation of all history
  • If you could do better knowing the history, then the state does not have the Markov property
  • We will later come to infinite or continuous states

What is an action?

  • There are finite actions (infinite or continuous actions later)
  • At each time step, some action must be taken but that can be a no-op
  • The effect of the action is determined by the transition model

What is a transition model?

  • A transition model is a (stochastic) mapping between states, actions, subsequent states, and rewards, $$ p(s’, r | s, a) $$
  • It represents how the environment “works”

Goals and Rewards

The reward hypothesis

That all of what we mean by goals and purposes can be well thought of as a maximisation of the expected value of the cumulative sum of a received scalar signal (called reward). —Michael Littman (S&B)

Rewards and Episodes

Long term reward and Episodes

We aim to maximise our expected reward $G_t$, which is the sum of all future rewards, $$ G_t \doteq Rt+1 + Rt+2 + Rt+3 + \cdots + RT $$ ending in the final reward at time $T$.

An episode is everything up to a final time step $T$.

Note that if $T$ were infinite, we would have the potential for infinite long-term reward.

Discounted reward

It is often natural to think of a gain in some distant future as being not so valuable as a gain right now. $$ G_t \doteq Rt+1 + γ Rt+2 + γ^2 Rt+3 + \cdots $$ for $γ<1$.

$$ G_t \doteq ∑0\leq k < ∞ γ^k Rt + k + 1 $$

Note that it is now not necessary to place a finite limit on the episode length.

Unified notation for episodic and continuing tasks

Unified notation for episodic and continuing tasks

  • We can deal with the difference between episodic and continuing tasks using the concept of absorbing states
  • Absorbing states yield zero reward and always transition to the same state

$$ G_t \doteq ∑t+1\leq k \leq T γk-t-1 Rk $$

where $T=∞$ or $γ = 1$ (but not both).

Policies and value functions

Policies

  • A policy represents how to act in each possible state
  • Policies are a distribution over actions $$ π(a | s) → [0, 1] $$ For all $s∈ \mathcal{S}$, $$ ∑_a π(a | s) = 1 $$

Value functions

  • A state value function $v_π(s, a)$ is the long term value of being in state $s$ assuming that you follow policy $π$ $$ v_π(s) \doteq \mathbb{E}_π [G_t | S_t = s] $$
  • A state action value function $q_π(s,a)$ is the long term value of being in state $s$, taking action $a$, and then following $π$ from then on. $$ q_π(s,a) \doteq \mathbb{E}_π [G_t | S_t = s, A_t=a] $$

How do we select an action?

figures/backup-s-a.pdf

What is the consequence of taking that action?

figures/backup-a-s.pdf

Optimal policies and optimal value functions

Optimal policies and value functions

  • Given some MDP, what is the best value we can achieve? $$ v_*(s) = max_π v_π (s), $$ for all $s ∈ \mathcal{S}$
  • What is the best state action value achievable? $$ q_*(s, a) = max_π q_π (s, a), $$ for all $s ∈ \mathcal{S}, a ∈ \mathcal{A}(s)$

Exercise

  • Develop a recursive expression for $v_*(s)$ and $q_*(s,a)$ from what we know so far
  • Feel free to look at the book for help