Permalink
Show file tree
Hide file tree
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
17 changed files
with
527 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
/dp.tex | ||
/intro.tex | ||
/mdp.tex | ||
/monte.pdf | ||
/mdp.pdf | ||
/intro.pdf | ||
/ltximg/ | ||
/dp.pdf | ||
/_minted-dp/ | ||
/monte.tex | ||
*.aux | ||
*.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
#+title: Reinforcement Learning | ||
#+subtitle: Dynamic Programming | ||
#+author: Prof. James Brusey | ||
#+options: toc:nil h:2 | ||
#+startup: beamer | ||
#+language: dot | ||
#+latex_header: \usepackage{algorithmicx} | ||
* Policy evaluation (prediction) | ||
** Why study Dynamic Programming? | ||
+ DP assumes perfect model and is computationally expensive | ||
+ Still important theoretically, though | ||
+ Basis for understanding | ||
** Expectation (reminder) | ||
+ Expected value is the average outcome of a random event given a large number of trials | ||
+ Where all possible values are equally likely, this is simply the mean of possible values | ||
+ Where each $x_i$ occurs with probability $p(x_i)$, we have the sum | ||
$$ | ||
\mathbb{E}\left[ x \right] = \sum_{i}{p({x_i}) x_i} | ||
$$ | ||
+ Quick quiz: what's the expected value of a fair, 6-sized dice? | ||
** Expectation with a subscript? | ||
+ When we have a subscript, that means according to that probability measure | ||
$$ | ||
\mathbb{E}_y[x] = \sum_i p_y(x_i)x_i | ||
$$ | ||
where $p_y$ is a probability measure for $x$ that might differ from the sampling probability. | ||
|
||
** Bellman optimality equation | ||
+ Dynamic programming is based on solving Bellman's optimality equation | ||
$$ | ||
v_*(s) = \max_a \mathbb{E}\left[ R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a \right] | ||
$$ | ||
+ why is this equation so important and what do the variables mean? | ||
|
||
** Policy evaluation (prediction) | ||
Evaluation asks: | ||
+ Given a policy $\pi$, what is the expected value $v_\pi$ of each state $s$? | ||
$$ | ||
v_\pi(s) \doteq \mathbb{E}_\pi[G_t | S_t=s] | ||
$$ | ||
** Policy evaluation (2) | ||
+ We can substitute for the $G_t$ based on | ||
$$ | ||
G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots | ||
$$ | ||
or more simply | ||
$$ | ||
G_t = R_{t+1} + \gamma G_{t+1} | ||
$$ | ||
** Policy evaluation (3) | ||
We should get | ||
$$ | ||
v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)\left[r+\gamma v_\pi(s')\right] | ||
$$ | ||
where | ||
+ $p(s',r|s,a)$ is the transition probability | ||
+ $\pi(a|s)$ is the probability of taking action $a$ when in state $s$ | ||
+ $r$ is the immediate reward | ||
+ $\gamma$ is the discount factor | ||
** Policy evaluation (4) | ||
For this simple recycle robot, write down the set of equations for $v_\pi(s)$ assuming that $\pi$ is a uniform distribution over actions. | ||
|
||
[[file:figures/recycle-robot.pdf]] | ||
# #+begin_src dot :file recycle-robot.png | ||
# digraph { | ||
# wait [shape=point]; | ||
# high -> wait; | ||
# wait -> high [label="$$1, r_\mathrm{wait}$$"]; | ||
# } | ||
# #+end_src | ||
|
||
** Policy evaluation (5) | ||
Here is the equation for $v_\pi(\mathrm{high})$: | ||
\begin{eqnarray*} | ||
v_\pi(\mathrm{high}) & = & \frac{1}{2} (r_\mathrm{wait} + \gamma v_\pi (\mathrm{high})) + \\ | ||
&& \frac{1}{2} \Big( \alpha (r_\mathrm{search} + \gamma v_\pi (\mathrm{high})) + \\ | ||
&& (1-\alpha) (r_\mathrm{search}+ \gamma v_\pi (\mathrm{low}))\Big) | ||
\end{eqnarray*} | ||
** LP solution | ||
Note that we have equations in the form: | ||
$$ | ||
v(a) = k_1 v(a) + k_2 v(b) + k_3 | ||
$$ | ||
which can be converted into | ||
$$ | ||
-k_3 = (k_1 - 1) v(a) + k_2 v(b) | ||
$$ | ||
or | ||
$$ | ||
\begin{pmatrix} | ||
-k_3 \\ | ||
\vdots | ||
\end{pmatrix} = \begin{bmatrix} | ||
(k_1 - 1) & k_2 \\ | ||
\vdots & \vdots | ||
\end{bmatrix} \mathbf{v} | ||
$$ | ||
|
||
Which can be solved by inverting the matrix. | ||
/Do try this at home!/ | ||
** Iterative policy evaluation | ||
[[file:figures/fig4.1a.pdf]] | ||
** Iterative policy evaluation | ||
[[file:figures/fig4.1b.pdf]] | ||
|
||
** Approximate methods | ||
+ It is also possible to find the solution /iteratively/ by starting off with some guesses for $v$ and updating according to the equations. | ||
+ This approach forms the basis of many other methods and for large MDPs we will never attempt using LP | ||
* Policy improvement | ||
** Policy improvement | ||
+ Given that we can evaluate any policy, we can therefore /improve/ it by updating the policy so that it takes the action that maximises the resulting value | ||
+ We define the expected value of taking an action $a$ and then following $\pi$ from then on | ||
$$ | ||
q_\pi(s, a) \doteq \mathbb E [R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = a] | ||
$$ | ||
+ which expands to | ||
$$ | ||
= \sum_{s',r} p(s',r|s, a) [r + \gamma v_\pi(s')] | ||
$$ | ||
** Policy improvement (2) | ||
+ If we can find some $a$ for which $q_\pi (s, a) > v_\pi(s)$, then we can /improve/ $\pi$ by adjusting it to use $a$ | ||
+ If we cannot find any improvement for all states then $\pi$ must be optimal | ||
+ Discuss: | ||
+ what aspects of MDPs are relied on to ensure this is true? | ||
|
||
* Policy iteration | ||
|
||
** Policy iteration algorithm | ||
S1: Initialisation | ||
|
||
S2: Policy evaluation | ||
+ Loop: | ||
+ $\Delta \leftarrow 0$ | ||
+ Loop for each $s\in S$: | ||
+ $v \leftarrow V(s)$ | ||
+ $V(s) \leftarrow \sum_{s',r} p(s',r|s, \pi(s)) [r + \gamma V(s')]$ | ||
+ $\Delta \leftarrow \max(\Delta, |v - V(s)|)$ | ||
+ until $\Delta < \theta$ | ||
|
||
S3: Policy Improvement | ||
+ $\textit{policy-stable} \leftarrow 1$ | ||
+ For each $s\in S$: | ||
+ $\textit{old-action}\leftarrow \pi(s)$ | ||
+ $\pi(s) \leftarrow \arg\max_a \sum_{s',r} p(s', r|s, a) [r + \gamma V(s')]$ | ||
+ If $\textit{old-action}\neq \pi(s)$, then $\textit{policy-stable} \leftarrow 0$ | ||
+ If $\textit{policy-stable}$, then stop and return $V\approx v_*$ and $\pi \approx \pi_*$; else go to 2 | ||
|
||
** Exercise: Policy iteration for action values | ||
|
||
Revise the algorithm on the previous page to use action values $q$ instead and thus find $q_*$. | ||
|
||
|
||
* Value iteration | ||
** Value iteration | ||
+ A problem with policy iteration is the need to perform policy evaluation to some accuracy before improving the policy. | ||
|
||
+ Value iteration skips finding the policy and just updates the value to the maximum value | ||
$$ | ||
v(s) \leftarrow \max_a \mathbb{E} [R_{t+1} + \gamma v(S_{t+1}) | S_t=s, A_t =a ] | ||
$$ | ||
#+beamer: \pause | ||
+ The algorithm terminates when the size of the updates to any $v(s)$ is smaller than some threshold | ||
|
||
** Phew! | ||
|
||
[[file:figures/happy-homer.png]] | ||
|
||
|
BIN
+34.8 KB
figures/agent-env.pdf
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
\documentclass[tikz,border=10pt]{standalone} | ||
\usetikzlibrary{positioning,shapes,arrows} | ||
|
||
\begin{document} | ||
\begin{tikzpicture}[ | ||
agent/.style={rectangle, draw, rounded corners, inner sep=10pt}, | ||
environment/.style={rectangle, draw, rounded corners, inner sep=10pt}, | ||
auto, node distance=1cm,>=latex' | ||
thickarrow/.style={->, line width=0.8mm, shorten >=3pt, shorten <=3pt, line cap=round}, | ||
thinarrow/.style={->, line width=0.4mm, shorten >=3pt, shorten <=3pt, line cap=round} | ||
] | ||
\node[agent] (a) {Agent}; | ||
\node[environment, below =of a] (e) {Environment}; | ||
|
||
\draw [thickarrow] (a) -- node[midway,right] {$A_t$} (e); | ||
\draw [->, line width=0.8mm] (e.east) to [bend right=45] node[midway,above right] {$S_{t+1}$} (a.east); | ||
\draw [->, line width=0.4mm] (e.west) to [bend left=45] node[midway,above left] {$R_{t+1}$} (a.west); | ||
\end{tikzpicture} | ||
\end{document} |
BIN
+194 KB
figures/fig4.1.pdf
Binary file not shown.
BIN
+194 KB
figures/fig4.1a.pdf
Binary file not shown.
BIN
+194 KB
figures/fig4.1b.pdf
Binary file not shown.
BIN
+730 KB
figures/happy-homer.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
BIN
+8.61 MB
figures/james_brusey_photo_1w1.2h.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
BIN
+1.57 KB
figures/maze.pdf
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
\documentclass[tikz,border=10pt]{standalone} | ||
\usetikzlibrary{calc} | ||
\begin{document} | ||
\begin{tikzpicture} | ||
\def\mazeSize{4} | ||
\foreach \i in {0,...,\mazeSize}{ | ||
\foreach \j in {0,...,\mazeSize}{ | ||
\node at (\i,\j) {}; | ||
} | ||
} | ||
\foreach \i in {0,...,\mazeSize}{ | ||
\draw[line width=1pt, dotted] (\i,0) -- (\i,\mazeSize); | ||
\draw[line width=1pt, dotted] (0,\i) -- (\mazeSize,\i); | ||
} | ||
|
||
\draw[line width=1pt] (0,0) -- (0,4) -- (4,4) -- (4,0) -- (0,0); | ||
|
||
\fill[green] (0.5,0.5) circle (0.1); | ||
\fill[red] (2.5,2.5) circle (0.1); | ||
|
||
\draw[line width=2pt] (1,0) -- (1,1) ; | ||
\draw[line width=2pt] (2,1) -- (2,2) -- (1,2) -- (1, 3); | ||
\draw[line width=2pt] (3,2) -- (3,3) -- (2,3); | ||
%% \draw[line width=2pt] (4,3) -- (4,4) -- (3,4); | ||
%% \draw[line width=2pt] (5,4) -- (5,5) -- (4,5); | ||
%% \draw[line width=2pt] (6,5) -- (6,6) -- (5,6); | ||
%% \draw[line width=2pt] (7,6) -- (7,7) -- (6,7); | ||
|
||
\draw[line width=2pt] (2,0) -- (2,1) -- (3,1) -- (3,2); | ||
%% \draw[line width=2pt] (4,2) -- (4,3) -- (5,3) -- (5,4); | ||
%% \draw[line width=2pt] (6,4) -- (6,5) -- (7,5) -- (7,6); | ||
|
||
%% \draw[line width=2pt] (7,7) -- (8,7) -- (8,8); | ||
%% \draw[line width=2pt] (0,8) -- (1,8) -- (1,7); | ||
|
||
% Specify start and end | ||
%% \node at (0.5,0.5) {Start}; | ||
%% \node at (2.5,2.5) {Goal}; | ||
\end{tikzpicture} | ||
\end{document} |
BIN
+10.9 KB
figures/maze1.pdf
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
\documentclass[tikz,border=10pt]{standalone} | ||
\usetikzlibrary{calc} | ||
\begin{document} | ||
\begin{tikzpicture} | ||
\def\mazeSize{4} | ||
\foreach \i in {0,...,\mazeSize}{ | ||
\foreach \j in {0,...,\mazeSize}{ | ||
\node at (\i,\j) {}; | ||
} | ||
} | ||
\foreach \i in {0,...,\mazeSize}{ | ||
\draw[line width=1pt, dotted] (\i,0) -- (\i,\mazeSize); | ||
\draw[line width=1pt, dotted] (0,\i) -- (\mazeSize,\i); | ||
} | ||
|
||
\draw[line width=1pt] (0,0) -- (0,4) -- (4,4) -- (4,0) -- (0,0); | ||
|
||
\fill[green] (0.5,0.5) circle (0.1); | ||
\fill[red] (2.5,2.5) circle (0.1); | ||
|
||
\draw[line width=2pt] (1,0) -- (1,1) ; | ||
\draw[line width=2pt] (2,1) -- (2,2) -- (1,2) -- (1, 3); | ||
\draw[line width=2pt] (3,2) -- (3,3) -- (2,3); | ||
%% \draw[line width=2pt] (4,3) -- (4,4) -- (3,4); | ||
%% \draw[line width=2pt] (5,4) -- (5,5) -- (4,5); | ||
%% \draw[line width=2pt] (6,5) -- (6,6) -- (5,6); | ||
%% \draw[line width=2pt] (7,6) -- (7,7) -- (6,7); | ||
|
||
\draw[line width=2pt] (2,0) -- (2,1) -- (3,1) -- (3,2); | ||
%% \draw[line width=2pt] (4,2) -- (4,3) -- (5,3) -- (5,4); | ||
%% \draw[line width=2pt] (6,4) -- (6,5) -- (7,5) -- (7,6); | ||
|
||
%% \draw[line width=2pt] (7,7) -- (8,7) -- (8,8); | ||
%% \draw[line width=2pt] (0,8) -- (1,8) -- (1,7); | ||
|
||
% Specify start and end | ||
%% \node at (0.5,0.5) {Start}; | ||
%% \node at (2.5,2.5) {Goal}; | ||
\end{tikzpicture} | ||
\end{document} |
BIN
+1 MB
figures/recycle-robot.pdf
Binary file not shown.
Oops, something went wrong.