Reinforcement Learning

Aim

The aim for this lab session is to try to code a basic algorithm, such as Sarsa or Q-learning, to learn a policy for FrozenLake-v1.

Note that you might need to consult the previous lab sheet to find out how to get started with FrozenLake. We assume that the 4x4 map is used.

Sarsa (on-policy TD control)

Implement in python

Try to have a close correspondence between the algorithm and your code. Note: I have not implemented policy—you’ll need to do that. Also, this code is untested—it may not work! Check it carefully against the pseudo code.

import numpy as np
Q = np.zeros((N_STATES, N_ACTIONS))

for i in range(1000):
    s, info = env.reset()
    a = policy(s, Q)

    while True:
        sprime, reward, terminated, truncated, info = env.step(a)
        aprime = policy(sprime, Q)
        Q[s, a] += alpha * (reward + gamma * Q[sprime, aprime] - Q[s, a])
        s, a = sprime, aprime
        if terminated or truncated:
            break

Evaluation

A simple method of evaluating the policy so far is by keeping track of the reward per episode.
For some problems, it makes more sense to use the average reward per step.
Generally, learning should be turned off and $ε$ should be zero during
Given that this signal may be noisy, it is recommended to apply some form of smoothing.

Experiments to try

Try varying $α$ and $ε$.
Can you graph episode reward (after say 100 episodes) versus $α$?

rl-course/lab-2.org

Reinforcement Learning

Aim

Sarsa (on-policy TD control)

Implement in python

Evaluation

Experiments to try