Reinforcement Learning

Policy evaluation (prediction)

Why study Dynamic Programming?

DP assumes perfect model and is computationally expensive
Still important theoretically, though
Basis for understanding

Expectation (reminder)

Expected value is the average outcome of a random event given a large number of trials
Where all possible values are equally likely, this is simply the mean of possible values
Where each $x_i$ occurs with probability $p(x_i)$, we have the sum $$ \mathbb{E}\left[ x \right] = ∑_i{p({x_i}) x_i} $$
Quick quiz: what’s the expected value of a fair, 6-sized dice?

Expectation with a subscript?

When we have a subscript, that means according to that probability measure $$ \mathbb{E}_y[x] = ∑_i p_y(x_i)x_i $$ where $p_y$ is a probability measure for $x$ that might differ from the sampling probability.

Bellman optimality equation

Dynamic programming is based on solving Bellman’s optimality equation

$$ v_*(s) = max_a \mathbb{E}\left[ Rt+1 + γ v_*(St+1) | S_t=s, A_t=a \right] $$

why is this equation so important and what do the variables mean?

Policy evaluation (prediction)

Evaluation asks:

Given a policy $π$, what is the expected value $v_π$ of each state $s$? $$ v_π(s) \doteq \mathbb{E}_π[G_t | S_t=s] $$

Policy evaluation (2)

We can substitute for the $G_t$ based on $$ G_t \doteq R_t+1 + γ R_t+2 + γ^2 R_t+3 + \cdots $$ or more simply $$ G_t = R_t+1 + γ G_t+1 $$

Policy evaluation (3)

We should get $$ v_π(s) = ∑_a π(a|s) ∑_s’,r p(s’,r|s,a)\left[r+γ v_π(s’)\right] $$ where

$p(s’,r|s,a)$ is the transition probability
$π(a|s)$ is the probability of taking action $a$ when in state $s$
$r$ is the immediate reward
$γ$ is the discount factor

Policy evaluation (4)

For this simple recycle robot, write down the set of equations for $v_π(s)$ assuming that $π$ is a uniform distribution over actions.

figures/recycle-robot.pdf

Policy evaluation (5)

Here is the equation for $v_π(\mathrm{high})$: \begin{eqnarray*} v_π(\mathrm{high}) & = & \frac{1}{2} (r_\mathrm{wait} + γ v_π (\mathrm{high})) +
&& \frac{1}{2} \Big( α (r_\mathrm{search} + γ v_π (\mathrm{high})) + \ && (1-α) (r_\mathrm{search}+ γ v_π (\mathrm{low}))\Big) \end{eqnarray*}

LP solution

Note that we have equations in the form: $$ v(a) = k_1 v(a) + k_2 v(b) + k_3 $$ which can be converted into $$ -k_3 = (k_1 - 1) v(a) + k_2 v(b) $$ or $$ \begin{pmatrix} -k_3
\vdots \end{pmatrix} = \begin{bmatrix} (k_1 - 1) & k_2 \ \vdots & \vdots \end{bmatrix} \mathbf{v} $$

Which can be solved by inverting the matrix. Do try this at home!

Iterative policy evaluation

figures/fig4.1a.pdf

Iterative policy evaluation

figures/fig4.1b.pdf

Approximate methods

It is also possible to find the solution iteratively by starting off with some guesses for $v$ and updating according to the equations.
This approach forms the basis of many other methods and for large MDPs we will never attempt using LP

Policy improvement

Given that we can evaluate any policy, we can therefore improve it by updating the policy so that it takes the action that maximises the resulting value
We define the expected value of taking an action $a$ and then following $π$ from then on $$ q_π(s, a) \doteq \mathbb E [R_t+1 + γ v_π(S_t+1) | S_t = s, A_t = a] $$
which expands to $$ = ∑_s’,r p(s’,r|s, a) [r + γ v_π(s’)] $$

Policy improvement (2)

If we can find some $a$ for which $q_π (s, a) > v_π(s)$, then we can improve $π$ by adjusting it to use $a$
If we cannot find any improvement for all states then $π$ must be optimal
Discuss:
- what aspects of MDPs are relied on to ensure this is true?

Policy iteration

Policy iteration algorithm

S1: Initialisation

S2: Policy evaluation

Loop:
- $Δ ← 0$
- Loop for each $s∈ S$:
  - $v ← V(s)$
  - $V(s) ← ∑_s’,r p(s’,r|s, π(s)) [r + γ V(s’)]$
  - $Δ ← max(Δ, |v - V(s)|)$
- until $Δ < θ$

S3: Policy Improvement

$\textit{policy-stable} ← 1$
For each $s∈ S$:
- $\textit{old-action}← π(s)$
- $π(s) ← arg\max_a ∑_s’,r p(s’, r|s, a) [r + γ V(s’)]$
- If $\textit{old-action}≠ π(s)$, then $\textit{policy-stable} ← 0$
If $\textit{policy-stable}$, then stop and return $V≈ v_*$ and $π ≈ π_*$; else go to 2

Exercise: Policy iteration for action values

Revise the algorithm on the previous page to use action values $q$ instead and thus find $q_*$.

Value iteration

A problem with policy iteration is the need to perform policy evaluation to some accuracy before improving the policy.
Value iteration skips finding the policy and just updates the value to the maximum value $$ v(s) ← max_a \mathbb{E} [R_t+1 + γ v(S_t+1) | S_t=s, A_t =a ] $$

The algorithm terminates when the size of the updates to any $v(s)$ is smaller than some threshold

rl-course/dp.org

Reinforcement Learning

Policy evaluation (prediction)

Why study Dynamic Programming?

Expectation (reminder)

Expectation with a subscript?

Bellman optimality equation

Policy evaluation (prediction)

Policy evaluation (2)

Policy evaluation (3)

Policy evaluation (4)

Policy evaluation (5)

LP solution

Iterative policy evaluation

Iterative policy evaluation

Approximate methods

Policy improvement

Policy improvement

Policy improvement (2)

Policy iteration

Policy iteration algorithm

Exercise: Policy iteration for action values

Value iteration

Value iteration

Phew!