Reinforcement Learning
Aim
The aim for this lab session is to try to code a basic algorithm, such as Sarsa or Q-learning, to learn a policy for FrozenLake-v1.
Note that you might need to consult the previous lab sheet to find out how to get started with FrozenLake. We assume that the 4x4 map is used.
Sarsa (on-policy TD control)
Implement in python
Try to have a close correspondence between the algorithm and your code. Note: I have not implemented policy—you’ll need to do that. Also, this code is untested—it may not work! Check it carefully against the pseudo code.
import numpy as np
Q = np.zeros((N_STATES, N_ACTIONS))
for i in range(1000):
s, info = env.reset()
a = policy(s, Q)
while True:
sprime, reward, terminated, truncated, info = env.step(a)
aprime = policy(sprime, Q)
Q[s, a] += alpha * (reward + gamma * Q[sprime, aprime] - Q[s, a])
s, a = sprime, aprime
if terminated or truncated:
break
Evaluation
- A simple method of evaluating the policy so far is by keeping track of the reward per episode.
- For some problems, it makes more sense to use the average reward per step.
- Generally, learning should be turned off and
$ε$ should be zero during - Given that this signal may be noisy, it is recommended to apply some form of smoothing.
See also https://github.com/google-research/rliable
Experiments to try
- Try varying
$α$ and$ε$ . - Can you graph episode reward (after say 100 episodes) versus
$α$ ?