Skip to content
Permalink
Browse files
add lab1 and 2
  • Loading branch information
James Brusey committed Jun 5, 2023
1 parent f486634 commit 3fc3ed2b05cf6419c1dc6b551de40bdb4c4b7f0d
Show file tree
Hide file tree
Showing 10 changed files with 102 additions and 3 deletions.
BIN +1.72 MB figures/backup-a-s.pdf
Binary file not shown.
BIN +1.72 MB figures/backup-s-a.pdf
Binary file not shown.
BIN +473 KB figures/sarsa.pdf
Binary file not shown.
BIN +63.3 KB figures/sarsa.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
@@ -0,0 +1,46 @@
#+title: Reinforcement Learning
#+subtitle: Frozen Lake Play
#+author: Prof. James Brusey
#+options: toc:nil h:2
#+latex_class:scrartcl
#+latex_header: \usepackage{mathpazo}

The aim for this lab session is to get you up and running with Farama.org Gymnasium.
If you've previously heard of OpenAI's Gym, then this is the replacement after OpenAI dropped support.

CleanRL provides implementations of RL algorithms that can be used in conjunction with Gymnasium.

* Installation

For the first part of the tutorial, you only need to install Gymnasium.

1. Make sure you have a version of python3 installed >=3.7.1 and <3.10. Note that 3.10 is not currently supported by CleanRL.

2. Installation documentation for Gymnasium is provided at https://github.com/Farama-Foundation/Gymnasium#installation.

3. You will need Poetry https://python-poetry.org/docs/ for CleanRL.

4. You can install CleanRL following the notes at https://github.com/vwxyzjn/cleanrl#get-started.


Documentation is at https://gymnasium.farama.org


* Trying things out

** Tabular Q-learning on your own

A good place to start is with this blog post:

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0

Note that this post uses jupyter notebook but you are welcome to use python or ipython.

** Using one of the CleanRL algorithms

See the [[https://github.com/vwxyzjn/cleanrl]] page for how to run a pre-written RL algorithm (such as, PPO or DQN) on one of the example environments.





@@ -0,0 +1,48 @@
#+title: Reinforcement Learning
#+subtitle: Solving FrozenLake
#+author: Prof. James Brusey
#+options: toc:nil h:1

* Aim

The aim for this lab session is to try to code a basic algorithm, such as Sarsa or Q-learning, to learn a policy for FrozenLake-v1.

Note that you might need to consult the previous lab sheet to find out how to get started with FrozenLake.
We assume that the 4x4 map is used.

* Sarsa (on-policy TD control)
[[file:figures/sarsa.png]]
* Implement in python
Try to have a close correspondence between the algorithm and your code.
Note: I have not implemented policy---you'll need to do that.
Also, this code is untested---it may not work!
Check it carefully against the pseudo code.
#+BEGIN_SRC python
import numpy as np
Q = np.zeros((N_STATES, N_ACTIONS))

for i in range(1000):
s, info = env.reset()
a = policy(s, Q)

while True:
sprime, reward, terminated, truncated, info = env.step(a)
aprime = policy(sprime, Q)
Q[s, a] += alpha * (reward + gamma * Q[sprime, aprime] - Q[s, a])
s, a = sprime, aprime
if terminated or truncated:
break

#+END_SRC
* Evaluation
+ A simple method of evaluating the policy so far is by keeping track of the reward per episode.
+ For some problems, it makes more sense to use the average reward per step.
+ Generally, learning should be turned off and $\varepsilon$ should be zero during
+ Given that this signal may be noisy, it is recommended to apply some form of smoothing.

See also https://github.com/google-research/rliable


* Experiments to try
+ Try varying $\alpha$ and $\varepsilon$.
+ Can you graph episode reward (after say 100 episodes) versus $\alpha$?
11 mdp.org
@@ -11,6 +11,8 @@
$$
S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots
$$
** Backup diagram
file:figures/vpi-backup.pdf
** Some terms
+ Each MDP comprises the tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma \rangle$
+ $\mathcal{S}$ is the set of states
@@ -99,13 +101,16 @@ where $T=\infty$ or $\gamma = 1$ (but not both).
** Value functions
+ A state value function $v_\pi(s, a)$ is the long term value of being in state $s$ assuming that you follow policy $\pi$
$$
v_pi(s) \doteq \mathbb{E}_\pi [G_t | S_t = s]
v_\pi(s) \doteq \mathbb{E}_\pi [G_t | S_t = s]
$$
+ A state action value function $q_\pi(s,a)$ is the long term value of being in state $s$, taking action $a$, and then following $\pi$ from then on.
$$
q_pi(s,a) \doteq \mathbb{E}_\pi [G_t | S_t = s, A_t=a]
q_\pi(s,a) \doteq \mathbb{E}_\pi [G_t | S_t = s, A_t=a]
$$

** How do we select an action?
[[file:figures/backup-s-a.pdf]]
** What is the consequence of taking that action?
[[file:figures/backup-a-s.pdf]]
* Optimal policies and optimal value functions
** Optimal policies and value functions
+ Given some MDP, what is the best value we can achieve?
BIN +501 KB (380%) mdp.pdf
Binary file not shown.
BIN -15.1 KB (89%) monte.pdf
Binary file not shown.

0 comments on commit 3fc3ed2

Please sign in to comment.