Permalink
Show file tree
Hide file tree
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
10 changed files
with
102 additions
and
3 deletions.
There are no files selected for viewing
BIN
+1.72 MB
figures/backup-a-s.pdf
Binary file not shown.
BIN
+1.72 MB
figures/backup-s-a.pdf
Binary file not shown.
BIN
+473 KB
figures/sarsa.pdf
Binary file not shown.
BIN
+63.3 KB
figures/sarsa.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
BIN
+787 KB
figures/vpi-backup.pdf
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
#+title: Reinforcement Learning | ||
#+subtitle: Frozen Lake Play | ||
#+author: Prof. James Brusey | ||
#+options: toc:nil h:2 | ||
#+latex_class:scrartcl | ||
#+latex_header: \usepackage{mathpazo} | ||
|
||
The aim for this lab session is to get you up and running with Farama.org Gymnasium. | ||
If you've previously heard of OpenAI's Gym, then this is the replacement after OpenAI dropped support. | ||
|
||
CleanRL provides implementations of RL algorithms that can be used in conjunction with Gymnasium. | ||
|
||
* Installation | ||
|
||
For the first part of the tutorial, you only need to install Gymnasium. | ||
|
||
1. Make sure you have a version of python3 installed >=3.7.1 and <3.10. Note that 3.10 is not currently supported by CleanRL. | ||
|
||
2. Installation documentation for Gymnasium is provided at https://github.com/Farama-Foundation/Gymnasium#installation. | ||
|
||
3. You will need Poetry https://python-poetry.org/docs/ for CleanRL. | ||
|
||
4. You can install CleanRL following the notes at https://github.com/vwxyzjn/cleanrl#get-started. | ||
|
||
|
||
Documentation is at https://gymnasium.farama.org | ||
|
||
|
||
* Trying things out | ||
|
||
** Tabular Q-learning on your own | ||
|
||
A good place to start is with this blog post: | ||
|
||
https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0 | ||
|
||
Note that this post uses jupyter notebook but you are welcome to use python or ipython. | ||
|
||
** Using one of the CleanRL algorithms | ||
|
||
See the [[https://github.com/vwxyzjn/cleanrl]] page for how to run a pre-written RL algorithm (such as, PPO or DQN) on one of the example environments. | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
#+title: Reinforcement Learning | ||
#+subtitle: Solving FrozenLake | ||
#+author: Prof. James Brusey | ||
#+options: toc:nil h:1 | ||
|
||
* Aim | ||
|
||
The aim for this lab session is to try to code a basic algorithm, such as Sarsa or Q-learning, to learn a policy for FrozenLake-v1. | ||
|
||
Note that you might need to consult the previous lab sheet to find out how to get started with FrozenLake. | ||
We assume that the 4x4 map is used. | ||
|
||
* Sarsa (on-policy TD control) | ||
[[file:figures/sarsa.png]] | ||
* Implement in python | ||
Try to have a close correspondence between the algorithm and your code. | ||
Note: I have not implemented policy---you'll need to do that. | ||
Also, this code is untested---it may not work! | ||
Check it carefully against the pseudo code. | ||
#+BEGIN_SRC python | ||
import numpy as np | ||
Q = np.zeros((N_STATES, N_ACTIONS)) | ||
|
||
for i in range(1000): | ||
s, info = env.reset() | ||
a = policy(s, Q) | ||
|
||
while True: | ||
sprime, reward, terminated, truncated, info = env.step(a) | ||
aprime = policy(sprime, Q) | ||
Q[s, a] += alpha * (reward + gamma * Q[sprime, aprime] - Q[s, a]) | ||
s, a = sprime, aprime | ||
if terminated or truncated: | ||
break | ||
|
||
#+END_SRC | ||
* Evaluation | ||
+ A simple method of evaluating the policy so far is by keeping track of the reward per episode. | ||
+ For some problems, it makes more sense to use the average reward per step. | ||
+ Generally, learning should be turned off and $\varepsilon$ should be zero during | ||
+ Given that this signal may be noisy, it is recommended to apply some form of smoothing. | ||
|
||
See also https://github.com/google-research/rliable | ||
|
||
|
||
* Experiments to try | ||
+ Try varying $\alpha$ and $\varepsilon$. | ||
+ Can you graph episode reward (after say 100 episodes) versus $\alpha$? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters