Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

Reinforcement Learning

Policy evaluation (prediction)

Why study Dynamic Programming?

  • DP assumes perfect model and is computationally expensive
  • Still important theoretically, though
  • Basis for understanding

Expectation (reminder)

  • Expected value is the average outcome of a random event given a large number of trials
  • Where all possible values are equally likely, this is simply the mean of possible values
  • Where each $x_i$ occurs with probability $p(x_i)$, we have the sum $$ \mathbb{E}\left[ x \right] = ∑i{p({x_i}) x_i} $$
  • Quick quiz: what’s the expected value of a fair, 6-sized dice?

Expectation with a subscript?

  • When we have a subscript, that means according to that probability measure $$ \mathbb{E}_y[x] = ∑_i p_y(x_i)x_i $$ where $p_y$ is a probability measure for $x$ that might differ from the sampling probability.

Bellman optimality equation

  • Dynamic programming is based on solving Bellman’s optimality equation

$$ v_*(s) = max_a \mathbb{E}\left[ Rt+1 + γ v_*(St+1) | S_t=s, A_t=a \right] $$

  • why is this equation so important and what do the variables mean?

Policy evaluation (prediction)

Evaluation asks:

  • Given a policy $π$, what is the expected value $v_π$ of each state $s$? $$ v_π(s) \doteq \mathbb{E}_π[G_t | S_t=s] $$

Policy evaluation (2)

  • We can substitute for the $G_t$ based on $$ G_t \doteq Rt+1 + γ Rt+2 + γ^2 Rt+3 + \cdots $$ or more simply $$ G_t = Rt+1 + γ Gt+1 $$

Policy evaluation (3)

We should get $$ v_π(s) = ∑_a π(a|s) ∑s’,r p(s’,r|s,a)\left[r+γ v_π(s’)\right] $$ where

  • $p(s’,r|s,a)$ is the transition probability
  • $π(a|s)$ is the probability of taking action $a$ when in state $s$
  • $r$ is the immediate reward
  • $γ$ is the discount factor

Policy evaluation (4)

For this simple recycle robot, write down the set of equations for $v_π(s)$ assuming that $π$ is a uniform distribution over actions.

figures/recycle-robot.pdf

Policy evaluation (5)

Here is the equation for $v_π(\mathrm{high})$: \begin{eqnarray*} v_π(\mathrm{high}) & = & \frac{1}{2} (r_\mathrm{wait} + γ v_π (\mathrm{high})) +
&& \frac{1}{2} \Big( α (r_\mathrm{search} + γ v_π (\mathrm{high})) + \ && (1-α) (r_\mathrm{search}+ γ v_π (\mathrm{low}))\Big) \end{eqnarray*}

LP solution

Note that we have equations in the form: $$ v(a) = k_1 v(a) + k_2 v(b) + k_3 $$ which can be converted into $$ -k_3 = (k_1 - 1) v(a) + k_2 v(b) $$ or $$ \begin{pmatrix} -k_3
\vdots \end{pmatrix} = \begin{bmatrix} (k_1 - 1) & k_2 \ \vdots & \vdots \end{bmatrix} \mathbf{v} $$

Which can be solved by inverting the matrix. Do try this at home!

Iterative policy evaluation

figures/fig4.1a.pdf

Iterative policy evaluation

figures/fig4.1b.pdf

Approximate methods

  • It is also possible to find the solution iteratively by starting off with some guesses for $v$ and updating according to the equations.
  • This approach forms the basis of many other methods and for large MDPs we will never attempt using LP

Policy improvement

Policy improvement

  • Given that we can evaluate any policy, we can therefore improve it by updating the policy so that it takes the action that maximises the resulting value
  • We define the expected value of taking an action $a$ and then following $π$ from then on $$ q_π(s, a) \doteq \mathbb E [Rt+1 + γ v_π(St+1) | S_t = s, A_t = a] $$
  • which expands to $$ = ∑s’,r p(s’,r|s, a) [r + γ v_π(s’)] $$

Policy improvement (2)

  • If we can find some $a$ for which $q_π (s, a) > v_π(s)$, then we can improve $π$ by adjusting it to use $a$
  • If we cannot find any improvement for all states then $π$ must be optimal
  • Discuss:
    • what aspects of MDPs are relied on to ensure this is true?

Policy iteration

Policy iteration algorithm

S1: Initialisation

S2: Policy evaluation

  • Loop:
    • $Δ ← 0$
    • Loop for each $s∈ S$:
      • $v ← V(s)$
      • $V(s) ← ∑s’,r p(s’,r|s, π(s)) [r + γ V(s’)]$
      • $Δ ← max(Δ, |v - V(s)|)$
    • until $Δ < θ$

S3: Policy Improvement

  • $\textit{policy-stable} ← 1$
  • For each $s∈ S$:
    • $\textit{old-action}← π(s)$
    • $π(s) ← arg\max_a ∑s’,r p(s’, r|s, a) [r + γ V(s’)]$
    • If $\textit{old-action}≠ π(s)$, then $\textit{policy-stable} ← 0$
  • If $\textit{policy-stable}$, then stop and return $V≈ v_*$ and $π ≈ π_*$; else go to 2

Exercise: Policy iteration for action values

Revise the algorithm on the previous page to use action values $q$ instead and thus find $q_*$.

Value iteration

Value iteration

  • A problem with policy iteration is the need to perform policy evaluation to some accuracy before improving the policy.
  • Value iteration skips finding the policy and just updates the value to the maximum value $$ v(s) ← max_a \mathbb{E} [Rt+1 + γ v(St+1) | S_t=s, A_t =a ] $$
  • The algorithm terminates when the size of the updates to any $v(s)$ is smaller than some threshold

Phew!

figures/happy-homer.png