Reinforcement Learning
Policy evaluation (prediction)
Why study Dynamic Programming?
- DP assumes perfect model and is computationally expensive
- Still important theoretically, though
- Basis for understanding
Expectation (reminder)
- Expected value is the average outcome of a random event given a large number of trials
- Where all possible values are equally likely, this is simply the mean of possible values
- Where each
$x_i$ occurs with probability$p(x_i)$ , we have the sum $$ \mathbb{E}\left[ x \right] = ∑i{p({x_i}) x_i} $$ - Quick quiz: what’s the expected value of a fair, 6-sized dice?
Expectation with a subscript?
- When we have a subscript, that means according to that probability measure
$$
\mathbb{E}_y[x] = ∑_i p_y(x_i)x_i
$$
where
$p_y$ is a probability measure for$x$ that might differ from the sampling probability.
Bellman optimality equation
- Dynamic programming is based on solving Bellman’s optimality equation
- why is this equation so important and what do the variables mean?
Policy evaluation (prediction)
Evaluation asks:
- Given a policy
$π$ , what is the expected value$v_π$ of each state$s$ ? $$ v_π(s) \doteq \mathbb{E}_π[G_t | S_t=s] $$
Policy evaluation (2)
- We can substitute for the
$G_t$ based on $$ G_t \doteq Rt+1 + γ Rt+2 + γ^2 Rt+3 + \cdots $$ or more simply $$ G_t = Rt+1 + γ Gt+1 $$
Policy evaluation (3)
We should get $$ v_π(s) = ∑_a π(a|s) ∑s’,r p(s’,r|s,a)\left[r+γ v_π(s’)\right] $$ where
-
$p(s’,r|s,a)$ is the transition probability -
$π(a|s)$ is the probability of taking action$a$ when in state$s$ -
$r$ is the immediate reward -
$γ$ is the discount factor
Policy evaluation (4)
For this simple recycle robot, write down the set of equations for
Policy evaluation (5)
Here is the equation for
&& \frac{1}{2} \Big( α (r_\mathrm{search} + γ v_π (\mathrm{high})) + \
&& (1-α) (r_\mathrm{search}+ γ v_π (\mathrm{low}))\Big)
\end{eqnarray*}
LP solution
Note that we have equations in the form:
$$
v(a) = k_1 v(a) + k_2 v(b) + k_3
$$
which can be converted into
$$
-k_3 = (k_1 - 1) v(a) + k_2 v(b)
$$
or
$$
\begin{pmatrix}
-k_3
\vdots
\end{pmatrix} = \begin{bmatrix}
(k_1 - 1) & k_2 \
\vdots & \vdots
\end{bmatrix} \mathbf{v}
$$
Which can be solved by inverting the matrix. Do try this at home!
Iterative policy evaluation
Iterative policy evaluation
Approximate methods
- It is also possible to find the solution iteratively by starting off with some guesses for
$v$ and updating according to the equations. - This approach forms the basis of many other methods and for large MDPs we will never attempt using LP
Policy improvement
Policy improvement
- Given that we can evaluate any policy, we can therefore improve it by updating the policy so that it takes the action that maximises the resulting value
- We define the expected value of taking an action
$a$ and then following$π$ from then on $$ q_π(s, a) \doteq \mathbb E [Rt+1 + γ v_π(St+1) | S_t = s, A_t = a] $$ - which expands to $$ = ∑s’,r p(s’,r|s, a) [r + γ v_π(s’)] $$
Policy improvement (2)
- If we can find some
$a$ for which$q_π (s, a) > v_π(s)$ , then we can improve$π$ by adjusting it to use$a$ - If we cannot find any improvement for all states then
$π$ must be optimal - Discuss:
- what aspects of MDPs are relied on to ensure this is true?
Policy iteration
Policy iteration algorithm
S1: Initialisation
S2: Policy evaluation
- Loop:
$Δ ← 0$ - Loop for each
$s∈ S$ :$v ← V(s)$ - $V(s) ← ∑s’,r p(s’,r|s, π(s)) [r + γ V(s’)]$
$Δ ← max(Δ, |v - V(s)|)$
- until
$Δ < θ$
S3: Policy Improvement
$\textit{policy-stable} ← 1$ - For each
$s∈ S$ :$\textit{old-action}← π(s)$ - $π(s) ← arg\max_a ∑s’,r p(s’, r|s, a) [r + γ V(s’)]$
- If
$\textit{old-action}≠ π(s)$ , then$\textit{policy-stable} ← 0$
- If
$\textit{policy-stable}$ , then stop and return$V≈ v_*$ and$π ≈ π_*$ ; else go to 2
Exercise: Policy iteration for action values
Revise the algorithm on the previous page to use action values
Value iteration
Value iteration
- A problem with policy iteration is the need to perform policy evaluation to some accuracy before improving the policy.
- Value iteration skips finding the policy and just updates the value to the maximum value $$ v(s) ← max_a \mathbb{E} [Rt+1 + γ v(St+1) | S_t=s, A_t =a ] $$
- The algorithm terminates when the size of the updates to any
$v(s)$ is smaller than some threshold