add lab1 and 2

aa3172 · Jun 5, 2023 · 3fc3ed2b05cf6419c1dc6b551de40bdb4c4b7f0d · 3fc3ed2
1 parent f486634
commit 3fc3ed2b05cf6419c1dc6b551de40bdb4c4b7f0d
Show file tree

Hide file tree

Showing 10 changed files with 102 additions and 3 deletions.
diff --git a/figures/backup-a-s.pdf b/figures/backup-a-s.pdf
diff --git a/figures/backup-s-a.pdf b/figures/backup-s-a.pdf
diff --git a/figures/sarsa.pdf b/figures/sarsa.pdf
diff --git a/figures/sarsa.png b/figures/sarsa.png
diff --git a/figures/vpi-backup.pdf b/figures/vpi-backup.pdf
diff --git a/lab-1.org b/lab-1.org
@@ -0,0 +1,46 @@
+ #+title: Reinforcement Learning
+#+subtitle: Frozen Lake Play
+#+author: Prof. James Brusey
+#+options: toc:nil h:2
+#+latex_class:scrartcl
+#+latex_header: \usepackage{mathpazo}
+
+The aim for this lab session is to get you up and running with Farama.org Gymnasium.
+If you've previously heard of OpenAI's Gym, then this is the replacement after OpenAI dropped support.
+
+CleanRL provides implementations of RL algorithms that can be used in conjunction with Gymnasium. 
+
+* Installation
+
+For the first part of the tutorial, you only need to install Gymnasium.
+
+1. Make sure you have a version of python3 installed >=3.7.1 and <3.10. Note that 3.10 is not currently supported by CleanRL.
+
+2. Installation documentation for Gymnasium is provided at  https://github.com/Farama-Foundation/Gymnasium#installation.
+
+3. You will need Poetry https://python-poetry.org/docs/ for CleanRL.
+
+4. You can install CleanRL following the notes at https://github.com/vwxyzjn/cleanrl#get-started.
+
+
+Documentation is at https://gymnasium.farama.org
+
+
+* Trying things out
+
+** Tabular Q-learning on your own
+
+A good place to start is with this blog post:
+
+https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0
+
+Note that this post uses jupyter notebook but you are welcome to use python or ipython. 
+
+** Using one of the CleanRL algorithms
+
+See the [[https://github.com/vwxyzjn/cleanrl]] page for how to run a pre-written RL algorithm (such as, PPO or DQN) on one of the example environments.
+
+
+
+
+
diff --git a/lab-2.org b/lab-2.org
@@ -0,0 +1,48 @@
+#+title: Reinforcement Learning
+#+subtitle: Solving FrozenLake
+#+author: Prof. James Brusey
+#+options: toc:nil h:1
+
+* Aim
+
+The aim for this lab session is to try to code a basic algorithm, such as Sarsa or Q-learning, to learn a policy for FrozenLake-v1.
+
+Note that you might need to consult the previous lab sheet to find out how to get started with FrozenLake.
+We assume that the 4x4 map is used.
+
+* Sarsa (on-policy TD control)
+[[file:figures/sarsa.png]]
+* Implement in python
+Try to have a close correspondence between the algorithm and your code.
+Note: I have not implemented policy---you'll need to do that.
+Also, this code is untested---it may not work!
+Check it carefully against the pseudo code.
+#+BEGIN_SRC python
+import numpy as np
+Q = np.zeros((N_STATES, N_ACTIONS))
+
+for i in range(1000):
+    s, info = env.reset()
+    a = policy(s, Q)
+
+    while True:
+        sprime, reward, terminated, truncated, info = env.step(a)
+        aprime = policy(sprime, Q)
+        Q[s, a] += alpha * (reward + gamma * Q[sprime, aprime] - Q[s, a])
+        s, a = sprime, aprime
+        if terminated or truncated:
+            break
+
+#+END_SRC
+* Evaluation
++ A simple method of evaluating the policy so far is by keeping track of the reward per episode.
++ For some problems, it makes more sense to use the average reward per step.
++ Generally, learning should be turned off and $\varepsilon$ should be zero during 
++ Given that this signal may be noisy, it is recommended to apply some form of smoothing.
+
+See also https://github.com/google-research/rliable
+
+
+* Experiments to try
++ Try varying $\alpha$ and $\varepsilon$.
++ Can you graph episode reward (after say 100 episodes) versus $\alpha$?
diff --git a/mdp.org b/mdp.org
@@ -11,6 +11,8 @@
 $$
 S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \ldots
 $$
+** Backup diagram
+file:figures/vpi-backup.pdf
 ** Some terms
 + Each MDP comprises the tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \gamma \rangle$
 + $\mathcal{S}$ is the set of states
@@ -99,13 +101,16 @@ where $T=\infty$ or $\gamma = 1$ (but not both).
 ** Value functions
 + A state value function $v_\pi(s, a)$ is the long term value of being in state $s$ assuming that you follow policy $\pi$
   $$
-  v_pi(s) \doteq \mathbb{E}_\pi [G_t | S_t = s]
+  v_\pi(s) \doteq \mathbb{E}_\pi [G_t | S_t = s]
   $$
 + A state action value function $q_\pi(s,a)$ is the long term value of being in state $s$, taking action $a$, and then following $\pi$ from then on.
   $$
-  q_pi(s,a) \doteq \mathbb{E}_\pi [G_t | S_t = s, A_t=a]
+  q_\pi(s,a) \doteq \mathbb{E}_\pi [G_t | S_t = s, A_t=a]
   $$
-
+** How do we select an action?
+[[file:figures/backup-s-a.pdf]]
+** What is the consequence of taking that action?
+[[file:figures/backup-a-s.pdf]]
 * Optimal policies and optimal value functions
 ** Optimal policies and value functions
 + Given some MDP, what is the best value we can achieve?

diff --git a/mdp.pdf b/mdp.pdf
diff --git a/monte.pdf b/monte.pdf